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I, L. MICHAEL FURNESS, a citizen of the United Kingdom, residing at 2 
Brookside, Exning, Newmarket, United Kingdom, declare that: 

1. I was employed by Incyte Genomics, Inc. (hereinafter "Incyte") as a 
Director of Pharmacogenomics until December 31, 2001. I am currently under contract to be a 
Consultant to Incyte Genomics, Inc. 

2. In 1984, I received a B.Sc.(Hons) in Biomolecular Science (Biophysics 
and Biochemistry) from Portsmouth Polytechnic. 

From 1985-1987 I was at the School of Pharmacy in London, United Kingdom, 
during which time I analyzed lipid methyltransferase enzymes using a variety of protein analysis 
methods, including one-dimensional (ID) and two-dimensional (2D) gel electrophoresis, HPLC, 
and a variety of enzymatic assay systems. 
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I then worked in the Protein Structure group at the National Institute for Medical 
Research until 1989, setting up core facilities for nucleic acid synthesis and sequencing, as well 
as assisting in programs on protein kinase C inhibitors. 

After a year at Perkin Elmer- Applied Biosystems as a technical specialist, I 
worked at the Imperial Cancer Research Fund between 1990-1992, on a Eureka-funded program 
collaborating with Amersham Pharmacia in the United Kingdom and CEPH (Centre d'Etude du 
Polymorphisme Humaine) in Paris, France, to develop novel nucleic acid purification and 
characterization methods. 

In 1992, 1 moved to Pfizer Central Research in the United Kingdom, where I 
stayed until 1998, initially setting up core DNA sequencing and then a DNA arraying facility for 
gene expression analysis in 1993. My work also included bioinformatics and I was responsible 
for the support of all Pfizer neuroscience programs in the United Kingdom. This then led me 
into carrying out detailed bioinformatics and wet lab work on the sodium channels, including 
antibody generation, Western and Northern analyses, PCR, tissue distribution studies, and 
sequence analyses on novel sequences identified. 

In 1998, 1 moved to Incyte Genomics, Inc., to the Pharmacogenomics group, to 
look at the application of genomics and proteomics to the pharmaceutical industry. In 1999, 1 
was appointed director of the LifeExpress Lead Program which used microarray and protein 
expression data to identify pharmacologically and toxicologically relevant mechanisms to assist 
in improved drug design and development. 

On December 12, 2001, 1 founded Nuomics Consulting Ltd., in Exning, U.K., and 
I am currently employed as Managing Director. Nuomics Consulting Ltd. will be providing 
expert technical knowledge and advice to businesses around the areas of genomics, proteomics, 
pharmacogenomics, toxicogenomics and chemogenomics. 

3. I have reviewed the specification of a United States patent application that 
I understand was filed on February 12, 2001 in the names of Samuel T. LaBrie et al., and was 
assigned Serial No. 09/782,390 (hereinafter "the LaBrie '390 application"). Furthermore, I 
understand that this United States patent application was a divisional application of and claimed 
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priority to United States patent application Serial No. 08/812,824 filed on March 6, 1997 
(hereinafter "the LaBrie *824 application"), having essentially the identical specification, with the 
exception of corrected typographical errors and reformatting changes. Thus page and line 
numbers may not match as between the LaBrie '390 application and the LaBrie '824 application. 
My remarks herein will therefore be directed to the LaBrie '824 patent application, and March 6, 
1997, as the relevant date of filing. In broad overview, the LaBrie '824 specification pertains to 
certain nucleotide and amino acid sequences and their use in a number of applications, including 
gene and protein expression monitoring applications that are useful in connection with 
(a) developing drugs (e.g., for the treatment of cancer), and (b) monitoring the activity of drugs 
for purposes relating to evaluating their efficacy and toxicity. 

4. I understand that (a) the LaBrie '390 application contains claims that are 
directed to a substantially purified polypeptide having the sequence shown as SEQ ID NO: 1 
(hereinafter "the SEQ ID NO:l polypeptide"), and (b) the Patent Examiner has rejected those 
claims on the grounds that the specification of the LaBrie '390 application does not disclose a 
substantial, specific and credible utility for the claimed SEQ ID NO: 1 polypeptide. I further 
understand that whether or not a patent specification discloses a substantial, specific and credible 
utility for its claimed subject matter is properly determined from the perspective of a person 
skilled in the art to which the specification pertains at the time of the patent application was filed. 
In addition, I understand that a substantial, specific and credible utility under the patent laws 
must be a "real-world" utility. 

5. I have been asked (a) to consider with a view to reaching a conclusion (or 
conclusions) as to whether or not I agree with the Patent Examiner's position that the LaBrie '390 
application and its parent, the LaBrie '824 application, does not disclose a substantial, specific 
and credible "real-world" utility for the claimed SEQ ID NO:l polypeptide, and (b) to state and 
explain the bases for any conclusions I reach. I have been informed that, in connection with my 
considerations, I should determine whether or not a person skilled in the art to which the LaBrie 
'824 application pertains on March 6, 1997, would have concluded that the LaBrie '824 
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application disclosed, for the benefit of the public, a specific beneficial use of the SEQ ID NO: 1 

polypeptide in its then available and disclosed form. I have also been informed that, with respect 

to the "real-world" utility requirement, the Patent and Trademark Office instructs its Patent 

Examiners in Section 2107 of the Manual of Patent Examining Procedure, under the heading "I. 

'Real-World Value' Requirement": 

"Many research tools such as gas chromatography, screening assays, and 
nucleotide sequencing techniques have a clear, specific and unquestionable utility 
(e.g., they are useful in analyzing compounds). An assessment that focuses on 
whether an invention is useful only in a research setting thus does not address 
whether the specific invention is in fact 'useful' in a patent sense. Instead, Office 
personnel must distinguish between inventions that have a specifically identified 
utility and inventions whose specific utility requires further research to identify or 
reasonably confirm." 

6. I have considered the matters set forth in paragraph 5 of this Declaration 
and have concluded that, contrary to the position I understand the Patent Examiner has taken, the 
specification of the LaBrie '824 patent application disclosed to a person skilled in the art at the 
time of its filing a number of substantial, specific and credible real-world utilities for the claimed 
SEQ ID NO: 1 polypeptide. More specifically, persons skilled in the art on March 6, 1997 would 
have understood the LaBrie '824 application to disclose the use of the SEQ ID NO:l polypeptide 
as a research tool in a number of gene and protein expression monitoring applications that were 
well-known at that time to be useful in connection with the development of drugs and the 
monitoring of the activity of such drugs. I explain the bases for reaching my conclusion in this 
regard in paragraphs 7-13 below. 

7. In reaching the conclusion stated in paragraph 6 of this Declaration, I 
considered (a) the specification of the LaBrie '824 application, and (b) a number of published 
articles and patent documents that evidence gene and protein expression monitoring techniques 
that were well-known before the March 6, 1997 filing date of the LaBrie '824 application. The 
published articles and patent documents I considered are: 
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(a) Anderson, N.L., Esquer-Blasco, R., Hofrnann, J.-P., Anderson, 
N.G., A Two-Di mensional Gel Database of Rat Liver Proteins Useful in Gene Regulation and 
Drug Effects Studies. Electrophoresis, 12, 907-930 (1991) (hereinafter "the Anderson 1991 
article") (copy annexed at Tab A); 

(b) Anderson, N.L., Esquer-Blasco, R., Hofrnann, J.-P., Mehues, L., 
Raymackers, J., Steiner, S. Witzmann, F., Anderson, N.G., An Updated Two-Dimensional Gel 
Database o f Rat Liver Proteins Useful in Gene Regulation and Drug Effect Studies . 
Electrophoresis, 16, 1977-1981 (1995) (hereinafter "the Anderson 1995 article") (copy annexed 
at Tab B); 

(c) Wilkins, M.R., Sanchez, J.-C, Gooley, A.A., Appel, R.D., 
Humphery-Smith, I., Hochstrasser, D.F., Williams, K.L., Progress with Proteome Projects: Whv 
all Proteins Expressed b y a Genome Should be Identified and How To Do It Biotechnology and 
Genetic Engineering Reviews, 13, 19-50 (1995) (hereinafter "the Wilkins article") (copy annexed 
at Tab C); 

(d) Celis, J.E., Rasmussen, H.H., Leffers, H., Madsen, P., Honore, B., 
Gesser, B., Dejgaard, K., Vandekerckhove, J.. Human Cellular Protein Patterns and their Link to 
Genome DNA Sequence Data: Usefulness of Two-Dimensional Gel Electrophoresis and 
Microsequencing. FASEB Journal, 5, 2200-2208 (1991) (hereinafter "the Celis article") (copy 
annexed at Tab D); 

(e) Franzen, B., Linder, S., Okuzawa, K., Kato, H., Auer, G., 
Nonenzvm atic Extraction of Cells from Clinical Tumor Material for Analysis of Gene 
Expression bv Two-Di mensional Polvacrvlamide Gel Electrophoresis . Electrophoresis, 14, 1045- 
1053 (1993) (hereinafter "the Franzen article") (copy annexed at Tab E); 

(f) Bjellqvist, B., Basse, B., Olsen, E., Celis, J.E., Reference Points 
for Compa risons of Two-Dimensional Maps of Proteins from Different Human Cell Types 
Defined in a pH Scale Where Isoelectric Points Correlate with Polypeptide Compositions . 
Electrophoresis, 1 5, 529-539 ( 1994) (hereinafter "the Bjellqvist article") (copy annexed at 
Tab F); 

(g) Large Scale Biology Company Info; LSB and LSP Information; 
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from http://www.lsbc.com (2001) (copy annexed at Tab G); 

8. Many of the published articles I considered (i.e., at least items (a)-(f) 
identified in paragraph 7) relate to the development of protein two-dimensional gel 
electrophoretic techniques for use in gene expression monitoring applications in drug 
development and toxicology. As I will discuss below, a person skilled in the art who read the 
LaBrie '824 application on March 6, 1997 would have understood that application to disclose the 
SEQ ID NO:l polypeptide to be useful for a number of gene and protein expression monitoring 
applications, e.g., in the use of two-dimensional polyacrylamide gel electrophoresis and western 
blot analysis of tissue samples in drug development and in toxicity testing. 

9. Turning more specifically to the LaBrie '824 specification, the SEQ ID 
NO:l polypeptide is shown at pages 47-49 as one of four sequences under the heading "Sequence 
Listing." The LaBrie '824 specification specifically teaches that the "invention features a novel 
human tubby homolog (NHT) having the amino acid sequence shown in SEQ ID NO:l" (LaBrie 
'824 application at p. 2). It further teaches that (a) the identity of the SEQ ID NO:l polypeptide 
was determined from a "neuronal cell cDNA library", (b) the SEQ ID NO:l polypeptide is the 
novel tubby homolog referred to as "NHT" and is encoded by SEQ ID NO:2, and (c) northern 
analysis shows that "NHT is expressed in brain and neuronal tissues and lymph node tissue," and 
therefore "NHT appears to be involved in mammalian appetite and eating disorders, and to play a 
role in appetite and eating disorders, especially anorexia, cachexia and obesity" (LaBrie '824 
application at p. 10, line 30 to p. 31, linel5, p. 23, lines 1 1-13, and p. 2, lines 12-15). 

The LaBrie '824 application discusses a number of uses of the SEQ ID NO: 1 
polypeptide in addition to its use in gene expression monitoring applications. I have not fully 
evaluated these additional uses in connection with the preparation of this Declaration and do not 
express any views in this Declaration regarding whether or not the LaBrie '824 specification 
discloses these additional uses to be substantial, specific and credible real-world utilities of the 
SEQ ID NO: 1 polypeptide. Consequently, my discussion in this Declaration concerning the 
LaBrie '824 application focuses on the portions of the application that relate to the use of the 
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SEQ ID NO:l polypeptide in gene and protein expression monitoring applications. 

10. The LaBrie '824 application discloses that the polynucleotide sequences 
disclosed therein, including the polynucleotides encoding the SEQ ID NO:l polypeptide, are 
useful as probes in chip based technologies. It further teaches that the chip based technologies 
can be used "for the detection and/or quantification of nucleic acid or protein" (LaBrie '824 
application at p. 21, lines8-10). 

The LaBrie '824 application also discloses that the SEQ ID NO:l polypeptide is 
useful in other protein expression detection technologies. The LaBrie '824 application states that 
"[a] variety of protocols for detecting and measuring the expression of NHT, using either 
polyclonal or monoclonal antibodies specific for the protein are known in the art. Examples 
include enzyme-linked immunosorbent assay (ELISA), radioimmunoassay (RIA), and 
fluorescence activated cell sorting (FACS)" (LaBrie '824 application at p. 21, lines 19-22). 
Furthermore, the LaBrie '824 application discloses that "[a] variety of protocols including 
ELISA, RIA, and FACS for measuring NHT are known in the art and provide a basis for 
diagnosing altered or abnormal levels of NHT expression. Normal or standard values for NHT 
expression are established by combining body fluids or cell extracts taken from normal 
mammalian subjects, preferably human, with antibody to NHT under conditions suitable for 
complex formation" (LaBrie '824 application at p.32, lines 24-28). 

In addition, at the time of filing the LaBrie '824 application, it was well known in 
the art that "gene" and protein expression analyses also included two-dimensional 
polyacrylamide gel electrophoresis (2-D PAGE) technologies, which were developed during the 
1980s, and as exemplified by the Anderson 1991 and 1995 articles (Tab A and Tab B). The 
Anderson 1991 article teaches that a 2-D PAGE map has been used to connect and compare 
hundreds of 2-D gels of rat liver samples from a variety of studies including regulation of protein 
expression by various drugs and toxic agents (Tab A at p. 907). The Anderson 1991 article 
teaches an empirically-determined standard curve fitted to a series of identified proteins based 
upon amino acid chain length (Tab A at p. 91 1) and how that standard curve can be used in 
protein expression analysis. The Anderson 1991 article teaches that "there is a long-term need 
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for a comprehensive database of liver proteins" (Tab A at p. 912). 

The Wilkins article is one of a number of documents that were published prior to 
the March 6, 1997 filing date of the LaBrie '824 application that describes the use of the 2-D 
PAGE technology in a wide range of gene and protein expression monitoring applications, 
including monitoring and analyzing protein expression patterns in human cancer, human serum 
plasma proteins, and in rodent liver following exposure to toxins. In view of the LaBrie '824 
application, the Wilkins article, and other related pre-March 6, 1997 publications, persons skilled 
in the art on March 6, 1997 clearly would have understood the LaBrie '824 application to 
disclose the SEQ ID NO:l polypeptide to be useful in 2-D PAGE analyses for the development 
of new drugs and monitoring the activities of drugs for such purposes as evaluating their efficacy 
and toxicity, as explained more fully in paragraph 12 below. 

With specific reference to toxicity evaluations, those of skill in the art who were 
working on drug development in March 1997 (and for many years prior to March 1997) without 
any doubt appreciated that the toxicity (or lack of toxicity) of any proposed drug they were 
working on was one of the most important criteria to be considered and evaluated in connection 
with the development of the drug. They would have understood at that time that good drugs are 
not only potent, they are specific. This means that they have strong effects on a specific 
biological target and minimal effects on all other biological targets. Ascertaining that a 
candidate drug affects its intended target, and identification of undesirable secondary effects (i.e., 
toxic side effects), had been for many years among the main challenges in developing new drugs. 
The ability to determine which genes are positively affected by a given drug, coupled with the 
ability to quickly and at the earliest time possible in the drug development process identify drugs 
that are likely to be toxic because of their undesirable secondary effects, have enormous value in 
improving the efficiency of the drug discovery process, and are an important and essential part of 
the development of any new drug. In fact, the desire to identify and understand toxicological 
effects using the experimental assays described above led Dr Leigh Anderson to found the Large 
Scale Biology Corporation in 1985, in order to pursue commercial development of the 2-D 
electrophoretic protein mapping technology he had developed. In addition, the company focused 
on toxicological effects on the proteome as clearly demonstrated by its goals and by its senior 
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management credentials described in company documents (see Tab G at pp. 1, 3, and 5). 

Accordingly, the teachings in the LaBrie '824 application, in particular regarding 
use of SEQ ID NO: 1 in differential gene and protein expression analysis (2-D PAGE maps) and 
in the development and the monitoring of the activities of drugs, clearly includes toxicity studies 
and persons skilled in the art who read the LaBrie '824 application on March 6, 1997 would have 
understood that to be so. 

11. As previously discussed (supra, paragraphs 7 and 8), my experience with 
protein analysis methods in the mid-1980s and the several publications annexed to this 
Declaration at Tabs A through F evidence information that was available to the public regarding 
two-dimensional polyacrylamide gel electrophoresis technology and its uses in drug discovery 
and toxicology testing before the March 6, 1997 filing date of the LaBrie '824 application. In 
particular the Celis article stated that "protein databases are expected to foster a variety of 

biological information.... — among others, drug development and testing" (See Tab D, p. 

2200, second column). The Franzen article shows that 2-D PAGE maps were used to identify 
proteins in clinical tumor material (See Tab E). The LaBrie '824 application clearly discloses 
that expression of NHT is associated with brain, neuronal and lymph node tissues (LaBrie '824 
application at p. 11, lines 13-15). The Bjellqvist article showed that a protein may be identified 
accurately by its positional co-ordinates, namely molecular mass and isoelectric point (See Tab 
F). The LaBrie '824 application clearly disclosed SEQ ID NO: 1 from which it would have been 
routine for one of skill in the art to predict both the molecular mass and the isoelectric point 
using algorithms well known in the art at the time of filing. 

12. A person skilled in the art on March 6, 1997, who read the LaBrie '824 
application, would understand that application to disclose the SEQ ID NO:l polypeptide to be 
highly useful in analysis of differential expression of proteins. For example, the specification of 
the LaBrie '824 application would have led a person skilled in the art in March 1997 who was 
using protein expression monitoring in connection with working on developing new drugs for the 
treatment of an appetite and eating disorders, especially anorexia, cachexia and obesity to 
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conclude that a 2-D PAGE map that used the substantially purified SEQ ID NO:l polypeptide 
would be a highly useful tool and to request specifically that any 2-D PAGE map that was being 
used for such purposes utilize the SEQ ID NO:l polypeptide sequence. Expressed proteins are 
useful for 2-D PAGE analysis in toxicology expression studies for a variety of reasons, 
particularly for purposes relating to providing controls for the 2-D PAGE analysis, and for 
identifying sequence or post-translational variants of the expressed sequences in response to 
exogenous compounds. Persons skilled in the art would appreciate that a 2-D PAGE map that 
utilized the SEQ ID NO: 1 polypeptide sequence would be a more useful tool than a 2-D PAGE 
map that did not utilize this protein sequence in connection with conducting protein expression 
monitoring studies on proposed (or actual) drugs for treating appetite and eating disorders, 
especially anorexia, cachexia and obesity for such purposes as evaluating their efficacy and 
toxicity. 

I discuss in more detail in items (a)-(b) below a number of reasons why a person 
skilled in the art, who read the LaBrie '824 specification in March 1997, would have concluded 
based on that specification and the state of the art at that time, that SEQ ID NO:l polypeptide 
would be a highly useful tool for analysis of a 2-D PAGE map for evaluating the efficacy and 
toxicity of proposed drugs for appetite and eating disorders, especially anorexia, cachexia and 
obesity by means of 2-D PAGE maps, as well as for other evaluations: 

(a) The LaBrie '824 specification contains a number of teachings that 
would lead persons skilled in the art on March 6, 1997 to conclude that a 2-D PAGE map that 
utilized the substantially purified SEQ ID NO: 1 polypeptide would be a more useful tool for 
gene expression monitoring applications relating to drugs for treating appetite and eating 
disorders, especially anorexia, cachexia and obesity than a 2-D PAGE map that did not use the 
SEQ ID NO:l polypeptide sequence. Among other things, the LaBrie '824 specification teaches 
that (i) the identity of the SEQ ID NO:l polypeptide was determined from a "neuronal cell line 
cDNA library," (ii) the SEQ ID NO: 1 polypeptide is the novel Tubby homolog referred to as 
NHT, and (iii) NHT is expressed in various libraries derived from brain and neuronal tissue (fetal 
and infant brain) and lymph node tissues and, therefore, "NHT appears to be involved in maturity 
onset diabetes, insulin resistance, progressive retinal degeneration and hearing loss, and to play a 
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role in appetite and eating disorders, especially anorexia, cachexia and obesity" (LaBrie '824 
application at pp. 2; see paragraph 9, supra). The substantially purified polypeptide could 
therefore be used as a control to more accurately gauge the expression of NHT in the sample and 
consequently more accurately gauge the effect of a toxicant on expression of the gene. 

(b) Persons skilled in the art on March 6, 1997 would have appreciated (i) 
that the protein expression monitoring results obtained using a 2-D PAGE map that utilized a 
SEQ ID NO:l polypeptide would vary, depending on the particular drug being evaluated, and (ii) 
that such varying results would occur both with respect to the results obtained from the SEQ ID 
NO:l polypeptide and from the 2-D PAGE map as a whole (including all its other individual 
proteins). These kinds of varying results, depending on the identity of the drug being tested, in 
no way detracts from my conclusion that persons skilled in the art on March 6, 1997, having read 
the LaBrie '824 specification, would specifically request that any 2-D PAGE map that was being 
used for conducting protein expression monitoring studies on drugs for treating appetite and 
eating disorders, especially anorexia, cachexia and obesity {e.g., a toxicology study or any 
efficacy study of the type that typically takes place in connection with the development of a 
drug) utilize the SEQ ID NO:l polypeptide sequence. Persons skilled in the art on March 6, 
1997 would have wanted their 2-D PAGE map to utilize the SEQ ID NO:l polypeptide sequence 
because a 2-D PAGE map that utilized protein sequence information the polypeptide (as 
compared to one that did not) would provide more useful results in the kind of gene expression 
monitoring studies using 2-D PAGE maps that persons skilled in the art have been doing since 
well prior to March 6, 1997. 

The foregoing is not intended to be an all-inclusive explanation of all my reasons 
for reaching the conclusions stated in this paragraph 12, and in paragraph 6, supra. In my view, 
however, it provides more than sufficient reasons to justify my conclusions stated in paragraph 6 
of this Declaration regarding the LaBrie '824 application disclosing to persons skilled in the art 
at the time of its filing substantial, specific and credible real-world utilities for the SEQ ID NO: 1 
polypeptide. 

13. Also pertinent to my considerations underlying this Declaration is the fact 
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that the LaBrie *824 disclosure regarding the uses of the SEQ ID NO: 1 polypeptide for protein 
expression monitoring applications is not limited to the use of that protein in 2-D PAGE maps. 
For one thing, the LaBrie '824 disclosure regarding the technique used in gene and protein 
expression monitoring applications is broad (LaBrie '824 application at, e.g., p. 21, lines 6-10 
and p. 32, line 24 to p. 33, line 2). 

In addition, the LaBrie '824 specification repeatedly teaches that the protein 
described therein (including the SEQ ID NO: 1 polypeptide) may desirably be used in any of a 
number of long established "standard" techniques, such as ELISA or western blot analysis, for 
conducting protein expression monitoring studies. See, e.g.: 

(a) LaBrie '824 application at [p. 21, lines 19-22 ("A variety of protocols 
for detecting and measuring the expression of NHT, using either polyclonal or monoclonal 
antibodies specific for the protein are known in the art. Examples include enzyme-linked 
immunosorbent assay (ELISA), radioimmunoassay (RIA), and fluorescence activated cell sorting 
(FACS)"); 

(b) LaBrie '824 application at p. 32, line 24 to p. 33, line 2 ("A variety of 
protocols including ELISA, RIA, and FACS for measuring NHT are known in the art and 
provide a basis for diagnosing altered or abnormal levels of NHT expression. Normal or 
standard values for NHT expression are established by combining body fluids or cell extracts 
taken from normal mammalian subjects, preferably human, with antibody to NHT under 
conditions suitable for complex formation. The amount of standard complex formation may be 
quantified by various methods, but preferably by photometric, means. Quantities of NHT 
expressed in subject, control and disease, samples from biopsied tissues are compared with the 
standard values. Deviation between standard and subject values establishes the parameters for 
diagnosing disease"). 

Thus a person skilled in the art on March 6, 1997, who read the LaBrie '824 
specification, would have routinely and readily appreciated that the SEQ ID NO:l polypeptide 
disclosed therein would be useful to conduct gene expression monitoring analyses using 2-D 
PAGE mapping or western blot analysis or any of the other traditional membrane-based protein 
expression monitoring techniques that were known and in common use many years prior to the 
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filing of the LaBrie '824 application. For example, a person skilled in the art in March 1997 
would have routinely and readily appreciated that the SEQ ID NO: 1 polypeptide would be a 
useful tool in conducting protein expression analyses, using the 2-D PAGE mapping or western 
analysis techniques, in furtherance of (a) the development of drugs for the treatment of appetite 
and eating disorders, especially anorexia, cachexia and obesity, and (b) analyses of the efficacy 
and toxicity of such drugs. 
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14. I declare further that all statements made herein of my own knowledge are 
true and that all statements made herein on information and belief are believed to be true; and 
further, that these statements were made with the knowledge that willful false statements and the 
like so made are punishable by fine or imprisonment, or both, and that willful false statements 
may jeopardize the validity of this application and any patent issuing thereon. 




L. Michael Furness, B.Sc. 



Signed at Exning, United Kingdom 
this 12 th day of December, 2002 
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D»ubuc of rtt liver proteins 




A two-dimensional gel database of rat liver proteins 
useful in gene regulation and drug effects studies 

A standard two-dimensional (2-D) protein map of Fischer 344 rat liver 
(F344MST3) is presented, with a tabular listing of more than 1200 protein species. 
Sodium dodecyl sulfate (SDS) molecular mass and isoelectric point have been es- 
tablished, based on positions of numerous internal standards. This map has been 
used to connect and compare hundreds of 2-D gels of rat liver samples from a va- 
riety of studies, and forms the nucleus of an expanding database describing rat 
liver proteins and their regulation by various drugs and toxic agents. An example 
of such a study, involving regulation of cholesterol synthesis by cholesterol-lower- 
ing .drugs and a high-cholesterol diet, is presented. Since the map has been ob- 
tained with a widely used and highly reproducible 2-D gel system (the Iso-Dalt* 
system), it can be directly related to an expanding body of work in other laborato- 
ries. 
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1 Introduction 

High-resolution two-dimensional electrophoresis of pro- 
teins, introduced in 1975 by OTarrell and others [1-4], has 
been used over the ensuing 16 years to examine a wide va- 
riety of biological systems, the results appearing in more 
than 5000 published papers. With the advent of computer- 
ized systems for analyzing two-dimensional (2-D) gel ima- 
ges and constructing spot databases, it is also possible to 
plan and assemble integrated bodies of information de- 
scribing the appearance and regulation of thousands of pro- 
tein gene products [5, 6]. Creating such databases involves 
amassing and organizing quantitative data from thousands 
of 2-D gels, and requires a substantial commitment in tech- 
nology and resources. 

Given the long-term effort required to develop a protein da- 
tabase, the choice of a biological system takes on consider- 
able importance. While in vitro systems are ideal foranswer- 
ing many experimental questions, especially in cancer re- 
search and genetics, our experience with cell cultures and 
tissue samples suggests that some in vivo approaches could 
have major advantages. In particular, we have noticed that 
liver tissue samples from rats and mice appear to show grea- 
ter quantitative reproducibility (in terms of individual pro- 
tein expression) than replicate cell cultures. This is perhaps 
a natural result of the homeostasis maintained in a com- 
plete animal vj. the well-known variability of cell cultures, 
the latter due principally to differences in reagents (e.g.. 
fetal bovine serum), conditions i^.y.. pH) and genetic "evo- 
lution" of cell lines while in culture. It is also more difficult 
to generate adequate amounts of protein from cell culture 
systems (particularly with attached cells), forcing the inves-. 
tigatorto resort to radioisotope-based or silver-based stain- 
detection methods. While these methods are more sensi- 
tive (sometimes much more sensitive) than the Coomassie 
Brilliant Blue (CBB) stain typically used for protein detec- 
tion in "large" protein samples, they are generally more vari- 
able, more labor-intensive and. in the case of radiographic 
methods, may generate highly "noisy" images, due to the 
properties of the films used. By contrast, large protein sam- 
ples can easily be prepared from liver using urea/Nonidet 
P-40 (NP-40) solubilization and stained with CBB, which 
has the advantage of being easily reproducible [8). Finally, 
there remains the question of the truthfulness* of many in 
vitro systems as compared to their in vivo analogs; how 
great are the changes caused by the introduction into a cul- 
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ture and the associated shift 10 strong selection for growth, 
and how do these affea experimental outcomes? Hence 
the apparent advantages of in vitro systems, in terms of ex- 
perimental manipulation, may be counterbalanced by 
other factors relating to 2-D data quality. 

There is a second important class of reasons for exploring 
the use of an in vivo biological system such as the liver. His- 
torically, there have been two broad approaches to the me- 
chanistic dissection of biochemical processes in intact cel- 
lular systems: genetics (a search for informative mutants) 
and the use of chemical agents (drugs and chemical toxins). 
Both approaches help us to understand complex systems 
by disrupting some specific functional element and show- 
ing us the result. With the development of techniques for 
genetic manipulation and cloning, the genetic approach 
can be effectively applied either in vitro or in vivo, although 
the in vitro route is usually quicker. The chemical approach 
can also be applied to either sort of biological system; here, 
however, the bulk of consistently acquired information is 
in experimental animals (rats and mice). While most biolo- 
gists know a short list of compounds having specific, experi- 
mentally useful effects inhibitors of protein synthesis, 
ionophores, polymerase inhibitors, channel blockers, nu- 
cleotide analogs, and compounds afTecting polymerization 
of cytoskeletal proteins), there is a much larger number of 
interesting chemically-induced effects, most of them char- 
acterized by toxicologisis and pharmacologists in rodent 
systems. Just as a thorough genetic analysis would involve 
saturating a genome with mutations, it is possible to ima- 
gine a saturating number of drugs, the analysis of whose ac- 
tions would reveal the complete biochemistry of the cell. 
While organized drug discovery efforts usually target spe- 
cific desired effects, the nature of the process, with its de- 
pendence on screening large numbers of compounds, ne- 
cessarily produces many unanticipated effects. It is there- 
fore reasonable to suppose that the required broad range of 
compounds necessary to achieve "biochemical saturation" 
may be forthcoming; in fact, it may already exist among the 
hundreds of thousands of compounds that failed to qualify 
as drugs. 

Among organs, the liver is an obvious choice for the study 
of chemical effects because of its well-known plasticity and 
responsiveness. The brain appears to be quite plastic {e.g. 
[7]), but it is a complicated mixture of cell types requiring 
skillful dissection for most experiments. The kidney, while 
quite responsive, also presents a potentially confounding 
mixture of cell types. The liver, by contrast, is made up of 
one predominant cell type which is easy to solubilize: the 
hepatocyte, representing more than 95% of its mass. Most 
importantly, the liver performs many homeostatic func- 
tions that require rapid modulation of gene expression. It 
appears that most chemical agents tested affect gene ex- 
pression in the liver at some dosage (N. Leigh Anderson, 
unpublished observations), an interesting contrast to our 
earlier work with lymphocytes, for example, which seem to 
be much less responsive. Such results conform to the expec- 
tation that cells with a homeostatic, physiological role 
should be more plastic than cells differentiated for a pur- 
pose dependent on the action of a limited number of spe- 
cific genes. 

The liver also allows the parallels between in vitro and in 
vivo systems to be examined in detail. Significant progress 



has been made in the development of mouse rat 
man hepatocyte culture systems, as well as in precis. h * 
tissue slices. Using such an arrav of techniques it « CU: 
ble to assemble a matrix of mammalian systems includ** 
mouse and rat in vivo on one level and mouse, rat and * 
man in vitro on a second level, and to compare effect ■ * L " 
tween species and between systems. This approach aliV* 
us to draw informed conclusions regarding the biocherr- * ' 
"universality" of biological responses among the manir 
and to offer some insight into the validity of in vitro*'* 
proaches for toxicological screening. We believe this dt*" 
will be necessary if in vitro alternatives are to achieve w ^ 
usage in government-mandated safety testing of drucs 
sumer products and industrial and agricultural chemiclu" 

A number of interesting studies have been published usi- 
2-D mapping to examine effects in the rodent liver a nu- 
ber of investigarors have made use of the technique 
screen for existing genetic variants [8-1 1] or induced muG- 
lions [12-14], mainly in the mouse. This work builds on th"- 
wealth of genetic information available on the mouse an; 
its established position as a mammalian mutation-dete" 
tion system. While some studies of chemical effects haV- 
been undertaken in the mouse (15-17), most have used iht 
rat [18-23]. The examination of the cytochrome p-450 $>>. 
tern, in particular, has been carried out almost exclusiveh 
on the rat [24. 25]. 

These considerations lead us to conclude that rodent live- 
offers the best opportunity to systematicallv examine an 
anay of gene regulation systems, and ultimately to build 2 
predictive model of large-scale mammalian gene control. 
The basic underlying foundation of such a project is a reli- 
able, reproducible master 2-D pattern of liver, to which on- 
going experimental results can be referred. In this paper, w? 
report such a master pattern for the acidic and neutral pro- 
teins of rat liverfpattem F344MST3). In future, this master 
will be supplemented by maps of basic proteins,and analog- 
ous maps of mouse and human liver. 



2 Materials and methods 
2.1 Sample preparation 

Liver is an ideal sample material for most biochemical stud- 
ies, including 2-D analysis. A sample is taken of approxima- 
tely 0.5 g of tissue from the apical end of the left lobe of the 
liver. Solubilization is effected as rapidly as practical; « 
delay of 5— 15 min appears to cause no major alteration in 
liver protein composition if the liver pieces are kept cold 
(e.g., on ice) in the interim. In the solubilization process, 
the liver sample is weighed, placed in a glass homogenizer 
(e.g., 15 mL Wheaton); 8 volumes of solubilizing solution* 

• The solubilizing solution is composed of 2% NP-40 (Sigma). 9*«£ 
(analytical grade, e.g.. BDH or Bio-Rad), 0.5% dithiothreitol <DJ7 : 
Sigma) apd 2% carrier ampholytes (pH 9-1 1 LKB: these come as3 *\j 
stock solution, so 2 % final concentration is achieved by making lhe . fi "~ 
solution 10% 9-1 1 Ampholine by volume). A large batch of solubil*** 
(several hundred mL) is made and stored frozen at -80°C in ,liQ "* 
sufficient to provide enough for one day's estimated sample P^J^T 
tion requirement. The solution is never allowed to become wtt1 ^ 
than room temperature at any stage during preparation or tn * wil *'j. 
use, since heating of concentrated urea solutions can produce con "^ 
nants that covalently modify proteins, producing artifactual cbanp 
shifts. Once thawed, any unused solubilizer is discarded. 
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[(i.e., 4 mL per Oi g tissue) and the mixture is ho- 
ed using first the loose- and then then the tight- At- 
Jglass pestle. This takes approximately 5 strokes with 
i pestle and is carried out at room temperature because 
ea would crystallize out in the cold. Once the liver sample 
thoroughly homogenized in the solubilizer. it is assumed 
it all the proteins are denatured (by the chaotropic effect 
the urea and NP-40 detergent) and the enzymes inacti- 
ted by the high pH (-9J). Therefore these samples may 
itept at room temperature until they can be centrifuged 
frozen as a group (within several hours of preparation). 
ie samples are centrifuged for 6 X 1 0* g min {e.g.. 500 000 
jfor 12 min using a Beckman TL-100 centrifuge). The 
;ntrifuge rotor is maintained at just below room tempera- 
te (e\g., 15-20 °C), but not too cold, so as to prevent the 
■ecipitation of urea. The centrifuge of choice is a Beckman 
LrlOO because of the sample tube sizes available, but any 
itracentrifuge accepting smallish tubes will suffice. When 
i appropriate centrifuge is not available near the site of 
unple preparation, samples can be frozen at -80 Z C and 
tawed prior to centrifugation and collection of superna- 
ints. Each supernatant is carefully removed following cen- 
ifugation and aliquoted into at least 4 clean tubes for stor- 
ge.This is done by transferring all the supernatant to one 
lean tube, mixing this gently (to assure homogeneous 
opposition) and then dividing it into 4 aliquots. The ali- 
uots are frozen immediately at — 80°C. These multiple ali- 
uots can provide insurance against a failed run or a freezer 
Teak down. 
jus- 

V* 

L2 Two-dimensional electrophoresis 
& 

lample proteins are resolved by 2-D electrophoresis using 
£20 X 25 cm Iso-Dalt* 2-D gel system ([26-29]; pro- 
ceed by LSB and by Hoefer Scientific Instruments, San 
Francisco) operating with 20 gels per batch. All first-dimen- 
iional isoelectric focusing (IEF) gels are prepared using the 
lame single standardized batch of carrier ampholytes 
gDH 4-8A in the present case, selected by LSB's batch- 
nesting program for rat and mouse database work**). A 10 
iSTsample of solubilized liver protein is applied to each gel, 
aid the gels are run for 33 000 to 34500 volt-hours using a 
progressively increasing voltage protocol implemented by 
^programmable high-voltage power supply. AnAnge- 
iicjue* computer-controlled gradient-casting system (pro- 
duced by LSB) is used to prepare second-dimensional sod- 
ium dodecyl sulfate (SDS) polyacrylamide gradient slab 
£5js in which the top 5 % of the gel is 1 1 %T acrylamide, and 
fljelo wer 95 % of the gel varies linearly from 1 1 % to 1 8 %T. 

gis system has recently been modified so as to employ a 
gmrnercially available 30.8%T acrylamide/ A^AT-methyle- 
tebisacrylamide prepared solution (thus avoiding the han- 
ging of the solid acrylamide monomer) and three addi- 
bonal stock solutions: buffer (made from Sigma pre-set 
Tris), persulfate and - A^A^-tetramethylethylenedi- 
.Jgine (TEMED). Each gel is identified by a computer- 
ggnted filter paper label polymerized into the lower left cor- 
g$pf the gel. First-dimensional IEF tube gels are loaded 

s material (succeeding certified batches of which are available from 
Hoefer Scientific Instruments) has the most linear pH gradient pro- 
duced by any ampholyte tested except for the Pharmacia wide range 
((which has an unacceptable tendency to bind high-molecular weight 
xtdic proteins, causing them to streak). 



directly (as extruded) onto the slab gels without equilibra- 
tion, and held in place by polyester fabric wedges (Wed- 
gies**, produced by LSB) to avoid the use of hot agarose. 
Second-dimensional slab gels are run overnight, in groups 
of 20, in cooled DALT tanks (10°C) with buffer circulation. 
All run. parameters, reagent source and lot information, 
and notations of deviation from expected results are ente- 
red by the technician responsible on a detailed, multi-page 
record of the experiment. 

23 Staining 

Following SDS-electrophoresis, slab gels are stained for 
protein using a colloidal Coomassie Blue G-250 procedure 
in covered plastic boxes, with 10 gels (totalling approxima- 
tely 1 L of gel) per box. This procedure (based on the work 
of Neuhoff [30, 31]) involves fixation in 1.5 L of 50% etha- 
nol and 2% phosphoric acid for 2 h. three 30 min washes, 
each in 2 L of cold tap water, and transfer to 1 .5 L of 34 % 
methanol, 17% ammonium sulfate and 2 % phosphoric acid 
for 1 h. followed by the addition of a gram of powdered Coo- 
massie Blue G-250 stain. Staining requires approximately 4 
days to reach equilibrium intensity, whereupon gels are 
transferred to cool tap water and their surfaces rinsed to re- 
move any paniculate stain prior to scanning. Gels may be 
kept for several months in water with added sodium azide. 
The water washes remove ethanol that would dissolve the 
stain (and render the system noncolloidal. with high back- 
grounds). The concentrated ammonium sulfate and meth- 
anol solution is diluted by equilibration with the water vol- 
ume of the gels to automatically achieve the correct final 
concentrations for colloidal staining. Practical advantages 
of this staining approach can be summarized as follows: (i) 
the low, flat background makes computer evaluation of 
small spots (max OD < 0.02) possible, especially when 
using laser densitometry; (ii) up to 1500 spots can be reli- 
ably detected on many gels (e.g., rat liver) at loadings low 
enough to preserve excellent resolution; and (iii) reprodu- 
cibility appears to be very good: at least several hundred 
spots have coefficients of reproducibility less than 15%. 
This value is at least as good as previous CBB methods, and 
significantly better than many silver stain systems. 

2.4 Positional standardization 

The carbamylated rabbit muscle creatine phosphokinase 
(CPK) standards [32] are purchased from Pharmacia and 
BDH. Amino acid compositions, and numbers of residues 
present in proteins used for internal standardization, are 
taken from the Protein Identification Resource (PIR) se- 
quence database [33], 



2.5 Computer analysis 

Stained slab gels are digitized in red light at 134 micron re- 
solution, using either a Molecular Dynamics laser scanner 
(with pixel sampling) or an Eikonix 78/99 CCD scanner. 
Raw digitized gel images are archived on high-density DAT 
tape (or equivalent storage media) and a greyscale video- 
print prepared from the raw digital image as hard-copy 
backup of the gel image. Gels are processed using the Kep- 
ler* software system (produced by LSB), a commercially 
available workstation-based software package built on 
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some of the principles of the earlier TYCHO system [34- 
41]. Procedure PROC008 is used to yield a spotlist giving 
position, shape and density information for each detected 
spot. This procedure makes use of digital filtering, mathe- 
matical morphology techniques and digital masking to re- 
move the background, and uses full 2-D least-squares opti- 
mization to refine the parameters of a 2-D Gaussian shape 
for each spot. Processing parameters and file locations are 
stored in a relational database, while various log files detail- 
ing operation of the automatic analysis software are ar- 
chived with the reduced data. The computed resolution and 
level of Gaussian convergence of each gel are inspected 
and archived for quality control purposes. 

Experiment packages are constructed using the Kepler ex- 
periment definition database to assemble groups of 2-D 
patterns corresponding to the experimental groups (e.g.. 
treated and control animals). Each 2-D pattern is matched 
to the appropriate ^master" 2-D pattern (pattern 
F344MST3 in the case of Fischer 344 rat liver), thereby 
providing linkage to the existing rodeni protein 2-D data- 
bases. The software allows experiments containing hun- 
dreds of gels to be constructed and analyzed as a unit, with 
up to 100 gels displayed on the screen at one time for com- 
parative purposes and multiple pages to accommodate ex- 
periments of > 1000 gels. For each treatment, proteins 
showing significant quantitative differences vs. appropriate 
controls are selected using group-wise statistical parame- 
ters (e.g.. Student's t-test, Kepler* procedure STUDENT). 
Proteins satisfying various quantitative criteria (such as P< 
0.001 difference from appropriate controls) are repre- 
sented as highlighted spots onscreen or on computer-plot- 
ted protein maps and stored as spot populations (i.e., logi- 
cal vectors) in a liver protein database. Quantitative data 
(spot parameters, statistical or other computed values) are 
stored as real-valued vectors in the database. Analysis of co- 
regulation is performed using a Pierson product-moment 
correlation (Kepler procedure CORREL) to determine 
whether groups of proteins are coordinately regulated by 
any of the treatments. Such groups can be presented graphi- 
cally on a protein map, and reported together with the statis- 
tical criteria used to assess the level of coregulation. Multi- 
variate statistical analysis (e.g., principal components 1 ana- 
lysis) is performed on data exported to SAS (SAS Institute). 



2.6 Graphical data output 

Graphical results are prepared in GKS and translated 
within Kepler* into output for any of a variety of devices. 
Linedrawing output is typically prepared as Postscript and 
printed on an Apple LaserWriter. Detailed maps presented 
here have been generated using an ultra-high-resolution 
Postscript-compatible Linotronic output device. Greyscale 
graphics are reproduced from the workstation screen using 
a Seikosha videoprinter. Patterns are shown in the standard 
orientation, with high molecular mass at the top and acidic 
proteins to the left. 

2.7 Experiment LSBC04 

In the study described here 12-week-old Charles River 
male F344 rats were used. Diets were prepared at LSB, 
based on a Purina 5755M Basal Purified Diet. Lovastatin 
and cholestyramine were obtained as prescription pharma- 



ceuticals, ground and mixed with the diet at mn. 
of 0.075% and 1%, respectively. The high St!?™?** 
was Purina 5801M-A (5% cholesterol plus 1 % s 0ri 01 diet 
late in the control diet). Animal work was carried l Ch * 
etiological Associates (Beihesda, MD). Animals 1™ 
climatized for one week on the control diet fed test c * 
trol diets for one week, and sacrificed on dav 8 a 0 "' 
daily doses of lovastatin and cholestyramine in annm gs 
groups were 37 mg/kg/day and 5 g/kg/day resn*?' 1 * 1 * 
based on the weight of the food consumed. Liver *Lm? y - 
were collected and prepared for 2-D electrophoresis ,? Pcs 
ing to the standard liver protocol (homogeniza inn 1' 
volumes of 9 m urea, 2% NP-40. 0.5% diiKSSi 1 ^ 1 
LKB pH 9-11 carrier ampholytes, followed \ T£Z 2 " 
lion for 30 min at 80000 X g). Kidney, brain'and 
samples were frozen. Gels were run as described 
and the data was analyzed using the Kepler* svstem r T 
were scaled, to remove the effect of differences in proton 
loading, by setting the summed abundances of a laree ! 
ber of matched spots equal for each gel (linear scaling, 



3 Results and discussion 



3.1 The rat liver protein 2-D map 



F344MST3 is a standard 2-D pattern of rat liver proteins 
based on the Fischer 344 strain. This pattern was initiated 
from a single 2-D gel and extensively edited in an experi- 
ment comparing it to a range of protein loads, so as to in- 
clude both small spots and well-resolved representations of 
high-abundance spots. More than 700 rat liver 2-D patterns 
have been matched to F344MST3 in a series of drug effects 
and protein characterization experiments, and numerous 
new spots (induced by specific drugs, for instance) have 
been added as a result. A modified version including addi- 
tional spots present in the Sprague-Dawley outbred rat has 
also been developed (data not shown). Figure 1 shows a 
greyscale representation and Fig. 2 a schematic plot of the 
master pattern. More than 1200 spots are included, most of 
which are visible on typical gels loaded with 10 \ih of solubi- 
lized liver protein prepared by the standard method and 
stained with colloidal Coomassie Blue. Master spot num- 
bers (MSN's) have been assigned to all proteins, and ap- 
pear in the following figures, each showing one quadrant of 
the pattern. Figure 3 shows the upper left (acidic, high 
molecular mass) quadrant. Fig. 4 the upper right (basic, 
high molecular mass) quadrant, Fig. 5 the lower left (acidic, 
low molecular mass) quadrant, and Fig. 6 the lower right 
(basic, low molecular mass) quadrant. The quadrants over- 
lap as an aid to moving between them. The gel position (in 
100 micron units), isoelectric point (relative to the CPK in- 
ternal p/standards) and SDS molecular mass (from the cali- 
bration curve in Fig. 8) are listed for each spot (Table 1). Be- 
cause of the precision of the CPK-p/ values, these parame- 
ters can be used to relate spot locations between gel sys- 
tems more reliably than using p/ measurements expressed 
as pH. A major objective of current studies is the identifier 
tion of all major spots corresponding to known liver p* 0 * 
teins, as well as rigorous definitions of subcellular orga- 
nelle contents. Of particular interest to us is the parallel de- 
velopment of identifications in the rat and mouse H vc ' 
maps, allowing detailed comparisons of gene expression ef- 
fects in the two systems.The results of these studies will be 
presented systematically in a later edition of this database 
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Rrc include here a useful series of 22 orienting ideniifi- 
jjons as an aid to other users of the rat liver pattern (Table 



2*Carbamylated charge standards, computed pfs and 
molecular mass standardization 

r chave previously shown that the use of a system of close- 
^aced internal pi markers (made by carbamylating a 
3sic protein) offers an accurate and workable solution to 
ic problem of assigning positions in the pi dimension [32]. 
he same system, based on 36 protein species made by car- 
amylating rabbit muscle CPK. has been used here to as- 
ign pfs to most rat liver acidic and neutral proteins. The 
iandards were coelectrophoresed with total liver proteins, 
Q d the standard spots added to a special version of the 
oaster pattern F344MST3. The gel A-coordinaies of all 
; ver protein spots lying within the CPK charge train were 
hen transformed into CPK pi positions by interpolation 
)etween the positions of immediately adjacent standards 
Table 1) using a Kepler* vector procedure. 

/has proven possible to compute fairly accurate p/ values 
or many proteins from the amino acid composition [42]. 
Jft have attempted here to test a further elaboration of this 
lpproach. in which we computed pT s for the CPK standards 
ihemselves, based on our knowledge of the rabbit muscle 
CPK sequence and the fact that adjacent members of the 
iarge train typically differ by blockage of one additional ly- 
sne residue (Table 3). We compared these values to similar 
computed pfs for an additional set of carbamylated stand- 
ards made from human hemoglobin beta chains and a se- 
ries of rat liver and human plasma proteins of known posi- 
tion and sequence (Fig. 7.Table 4). The result demonstrates 
good concordance between these systems. Two proteins 
show significant deviations: liver fatty-acid binding protein 
(FABP; #1 in Table 4) and protein disulphide isomerase 
(i20 in the table). The FABP spot present on F344MST3 
may represent a charge-modified version of a more basic 
Ifarent spot closer to the expected p/, not resolved in the 
IEF/SDS gel. Of particular importance is the fact that, by 
comparing computed p/*s of sequenced but unlocated pro- 
teins with the CPK p/s, we can assign a probable gel loca- 
Jjpn without making any assumptions regarding the actual 
gel pH gradient. This offers a useful shortcut, given the va- 
garies of pH measurement on small diameter IEF gels. We 
|ave used this approach to compute the CPK pfs of all rat 
3 mouse proteins in the PIR sequence database, as an aid 
protein identification (data not shown). 

border to standardize SDS molecular weight (SDS-MW), 
jehave used a standard curve fitted to a series of identified 
Jroteins (Fig. 8). Rather than using molecular mass per se, 
*e have elected to use the number of amino acids in the 
Polypeptide chain, as perhaps a better indication of the 
'jggth of the SDS-coated rod that is sieved by the second 
aension slab. The resulting values were multiplied by 
- (the weighted average mass of amino acids in se- 
genced proteins) to give predicted molecular masses. Be- 
gse we use gradient slabs, we have not constrained the fit- 
^curve to conform to any predetermined model; rather 
Juried many equations and selected the best using the 
Pgram "Tablecurve" on a PC. The equation chosen was> 
c/x, where y is the number of residues, x is the gel 



Y coordinate, a is 51 1.83, b is -0 273 1 and c is 33 183801 . The 
resulting fit appears to be fairly good over a broad range of 
molecular mass. 

33 An example of rat liver gene regulation: Cholesterol 
metabolism 

Experiment LSBC04 was designed as a small-scale test of 
the regulation of cholesterol metabolism in vivo by three 
agents included in the diet: lovastatin (Mevacor*,an inhibi- 
tor of HMG-CoA reductase); cholestyramine (a bile acid 
sequestrant that has the effect of removing cholesterol 
from the gut-liver recirculation); and cholesterol itself. The 
fast two agents should lower available cholesterol and the 
third should raise it, allowing manipulation of relevant 
gene expression control systems in both directions. Such 
an experiment offers an interesting test of the 2-D mapping 
system since most of the pathway enzymes are present in 
low abundance, many are membrane-bound and difficult 
to solubilize, and the pathway itself is complex. Approxima- 
tely 1000 proteins were separated and detected in liver ho- 
mogenates. Twenty-one proteins were found to be affected 
by at least one treatment, and these could be divided into 
several coregulated groups. 

3.3.1 MSN 413 (putative cytosolic HMG-CoA synthase) 
and sets of spots regulated coordinated or inversely 

One group of spots (including a spot assigned to the cyto- 
solic HMG-CoA synthase, MSN 413) showed the expected 
increase in abundance with lovastatin or cholestyramine, 
the synergistic further increase with lovastatin and choles- 
tyramine, and a dramatic decrease with the high cholesterol 
diet. Spot number 413 is the most strongly regulated pro- 
tein in the present experiment, showing a 5- to 10-fold in- 
duction after a 1 week treatment with 0.075% lovastatin and 
1% cholestyramine in the diet (Figs. 9 and 10). Its expres- 
sion follows precisely the expectation for an enzyme whose 
abundance is controlled by the cholesterol level; it is pro- 
gressively increased from the control levels by cholestyra- 
mine, lovastatin and lovastatin plus cholestyramine, and it 
sinks below the threshold of detection in animals fed the 
high cholesterol diet. This spot has been tentatively identi- 
fied as the cytosolic HMG-CoA synthase, based on a reac- 
tion with an antiserum to that protein provided by Dr. Mi- 
chael Greenspan at Merck Sharp &Dohme Research Labo- 
ratories. This enzyme lies immediately before HMG-CoA 
reductase in the liver cholesterol biosynthesis pathway, and 
is known to be co-regulated with it. Spot 413 has an SDS 
molecular weight of about 54 000 and a CPK pi of - 1 1 .4, in 
reasonably close agreement with a molecular weight of 
57300 and a CPK p/ of -15.7 computed from the known se- 
quence of the hamster enzyme [43]. 

Using a classical product-moment correlation test (Kepler 
procedure CORREL), a series of five additional spots was 
found to be coregulated with 413. The level of correlation 
was exceedingly high O 95%). Two of these, 1250 and 933, 
are at similar molecular weights and approximately one 
charge more acidic than 413 (Fig. 9), indicating that they 
may be covalently modified forms of the 413 polypeptide. 
This suspicion is strengthened by the observation that both 
spots are also stained by the antibody to cytosolic HMG- 
CoA synthase. The remaining three correlated spots appear 
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to compnsc an additional related pair (1253 and 1001) of 
around 40 kDa and a single spot (1119) of around 28 kDa 
Because these two presumed proteins are present at sub- 
stantially lowerabundances than 413,and because the cvto- 
solic HMG-CoAsynihase is reponed to consist of onlvone 
type of polypeptide, they are likely to represent other,' verv 
tightly coregulated enzymes. A second group of six spoti 
was selected based on a regulatory- pattern close to the in- 
verse of that for spot 413 (MSN's 34, 79, 178, 182,204,347* 
data not shown). For these proteins, the lowest level of ex- 
pression occurs with exposure to lovastaiin plus cholestyra- 
mine and the highest level upon exposure to the high-cho- 
lesterol diet. Spots 182 and 79 are highlv correlated and lie 
about one charge apart at the same molecular weight- they 
may thus be isoforms of a single protein. The other* four 
spots probably represent additional enzymes or subunits. 

3 J.2 MSN 235 and coregulated spots 

A third group of five spots, mainly comprised of mitochon- 
drial proteins including putative mitochondrial HMG- 
CoA synthase spots, showed a modest induction by lovasta- 
tin alone, but little or no effect with any of the other treat- 
ments (including the combination of lovastatin and choles- 
tyramine; Fig. 12). This result is intriguing because lovasta- 
tin was expected to afTect only the regulation of enzvmes of 
cholesterol synthesis, which is entirely extra-mi'tochon- 
dnal. Three of the spots (235, 134, 144) form a closely- 
packed triad at approximately 30 kDa, and are likely to re- 
present isoforms of one protein. All three spots are stained 
by an antibody to the mitochondrial form of HMG-CoA 
synthase obtained from Dr. Greenspan. Subcellular fractio- 
nation indicates a mitochondrial location. The other two 
spots (633 at about 38 kDa and 724 at about 69 kDa) are 
each present at lower abundance than the members of the 
triad. 



proteins of the putative mitochondrial pathuav 
much more variable in their expression in all grou i T So 
amination of all the coregulated groups suggest, ih ne * 
titative statistical techniques can extract a wealth or- Uan * 
esting information from large sets of reproducible V P ln if r 
abundance of spots in the 413 coregulation groun for e 
plcshows an amazing level of concordance in iheirii 
expression among the five individuals of the lovasJi ' 
cholestyramine treatment group. This effect is nord?,?"' 
differences in total protein loading, since thevhave ai^M 0 
been removed by scaling, and since proteins with Q uK? 
ferem regulation patterns can be demonstrated (e l % 
13). Such efTects raise the possibility that manvgene cor., 
lation sets may be revealed through the siudv of a 
ciently large population of control animals (i e with 
any experimental manipulation). This approach exDlriih!" 
natural biological variation in protein expression insteS3 
drug efTects, ofTers an important incentive fonhe constru 
lion of a large library of control animal patterns 



4 Conclusions 

Because of the widespread use of rat liver in both basic bio- 
chemistry and m toxicology, there is a long-term need for a 
comprehensive database of liver proieins.The rat livermas- 
ter pattern presented here has proven to be an accurate re- 
presentation of this system, having been matched to more 
than 700 gels to date. As the number of proteins identified 
and the number of compounds tested for gene expression 
efTects grows, we expect this database to contribute valu- 
able insights into gene regulation. Its practical utility in sev- 
eral areas of mechanistic toxicology is alreadv being de- 
monstrated. 

Received September 11, 1991 



3.3 3 An example of an anti-synergistic effect 

A sixth spot (367) shows strong induction by lovastatin 
(two- to threefold), and about half as much induction with 
lovastatin plus cholestyramine, but without sharing the ani- 
mal-animal heterogeneity pattern of the 235-set (Fig. 13) 
This protein is also mitochondrial, and represents the clear- 
est example of an anti-synergistic efTect of lovastatin and 
cholestyramine. The existence of such an efTect demon- 
strates that lovastatin and cholestyramine do not act exclu- 
sively through the same regulatory pathway. 

33.4 Complexity of the cholesterol synthesis pathway 

Taken together, these results suggest that treatment with lo- 
vastatin alone can afTect both cytosolic and mitochondrial 
pathways using HMG-CoA, while cholestyramine, on the 
other hand, either alone or in combination with lovastatin 
produces a strong efTect on the putative cytosolic pathway 
but httle or no effect on the putative mitochondrial path- 
way. An explanation for this difference may lie in lovasta- 
tin s effect on levels of HMG-CoA and related precursor 
compounds that are exchanged between the cytosol and 
the mitochondrion, whereas cholestyramine should affect 
only the cytosolic pathways directly controlled by cholester- 
ol and bile acid levels. It remains to be explained why some 
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Figure 9. Montage showing effects in tbe 
region of MSN:413.The montage sho«* 
small window into one portion of the 3-D 
pattern, one row of windows for each expe- 
rimental group, and one panel for each gd 
in the experiment. The left-most pine* 
in each row is a group-specific copy of in? 
master pattern followed by the patten* 
for the five individual rats in the gf«*- 
The highlighted protein spots (filled fift- 
ies) are spot 4 13 (on the right of each p**" 
el; idehtifted as cytosolic HMG-CoAS£ 
thase) and two modified forms of U < l2 T 
and 933). From the top, the rows (expef 
menul groups) are: high cholesterol. cec 
trols.cholestyraminejovastatin.andlo**" 

statin plus cholestyramine. 
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Figure 10. Bargraph showing the quantita- 
tive effects of various treatments on the 
abundance of MSN:4I3 (cytosolic HMG- 
CoA synthase) in the gels of Fig. 9. 




Figure J J. Bargraphs of a series of six core- 
gulated spots including MSN:413.In the 
bargraphs, the abundances of the appro- 
priate spot (master spot number shown at 
the top of the panel) in each animal are 
shown. The five five-animal groups are in 
the order (left to right): high cholesterol, 
controls, cholestyramine, lovastatin, and 
lovastatin plus cholestyramine. Each bar 
within a group represents one experimen- 
tal animal liver (one 2-D gel). Note the cor- 
related expression of the 6 spots, espe- 
cially in the two far right (most strongly in- 



duced) groups. 
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f/fv/r /.\ Data on a second coregulat-; 
group ofspois. presented as in Fig ] l.T:.* 
fourth experimental group Uovasia:;,- 
shows a modest induction, while the fiftr 
group flovasiatin plus cholestyramine, 
does not. 




Figure J3. Data on spot MSN:367, presented as in Fig. 1 1. This P*>»£ 
shows unambiguously the anti-synergistic effect of lovastatin and 
tyramine (fifth group) as compared to lovtsutin (fourth group). TbJ* 
ponse contrasts strongly with the regulation pattern seen in Fir n * 
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1. Mister table of proteins in the rai liver database" 



USM 



Y CP** SOSAV 



3 3 

"11 

15 
17 
10 
19 

20 
21 
22 
23 
24 
25 
27 
26 
29 
30 
32 
33 
34 
35 
36 
38 
39 
41 
42 
43 
44 
46 
47 
48 

49 

.50 

51 

52 
"53 

54 

55 

56 

57 

56 

59 

60 

61 

62 

65 

66 

67 

68 

69 

71 

72 

73 

74 

75 

76 

77 

78 

79 

80 

81 

82 

83 

84 

85 

86 

87 

88 

89 
90 
91 
92 
83 
94 



311 
566 
812 
549 
845 
629 
906 
755 
649 
1204 
332 
787 
313 
607 
1184 
1263 
743 
768 
1216 
1145 
1037 
863 
712 
763 
304 
1165 
684 
1318 
1924 
1203 
1391 
309 
605 
621 
1113 
1820 
725 
2001 
722 
678 
1682 
1091 
1171 
1400 
1853 
1888 
735 
1263 
1252 
779 
1064 
656 
636 
1582 
1570 
1264 
1338 
1833 
1767 
925 
534 
1811 
1412 
1471 
1662 
1596 
1817 
516 
1569 
1706 
651 
1415 
1773 
1338 
1708 



434 

263 

426 

268 

520 

589 

414 

298 

403 

448 

434 

424 

417 

516 

524 

446 

605 

112 

417 

445 

555 

412 

606 

694 

470 

569 

607 

569 

362 

586 

447 

454 

587 

535 

522 

499 

177 

500 

830 

533 

302 

580 

585 

624 

506 

567 

297 

312 

407 

692 

296 

589 

545 

583 

556 

621 

564 

363 

565 

738 

696 

363 

681 

347 

563 

479 

301 
1371 

696 

719 
329 

710 
545 
446 



<-35.0 
-24.3 
-16.0 
-25.2 
-15.3 
-21.6 
-14.0 
-17.5 
-20.9 
-6.7 
<-35.0 
-16.6 
<-35.0 
-16.1 
-9.0 
-8.0 
-17.8 
-17.2 
•6.6 
-9.5 
-11.3 
•14.9 
-18.7 
-17.3 
<-35.0 
-9.2 
•19.6 
-7.3 
-0.1 
-6.7 
-6.3 
<-35.0 
-22.5 
•21.8 
-10.0 
-0.9 
-18.3 
>0.0 
-184 
•19.8 
•2.5 
•10.3 
•3.2 
-6.2 
-0.6 
-04 
-18.1 
-8.0 
-8.1 
-16.8 
-10.8 
•20.6 
•21.2 
-3.6 
•3.8 
-8.0 
-7.0 
-0.6 
•1.5 
-13.6 
-26.1 
•1.0 
-6.0 
-5.0 
-2.7 
-3.4 
-0.9 
-27.0 
-3.5 
-2.2 
•20.8 
•6.0 
-1.4 
-7.0 
-2.2 



63.600 
102.900 
64.800 
101.000 
55.200 
50.000 
66.300 
90.200 
67.900 
62.100 
63.600 
65.000 
66.000 
55.500 
54.900 
62.400 
49.000 
348.600 
66.000 
62.500 
52.400 
66.600 
48.900 
43.800 
59.800 
51.400 
48.800 
50.000 
74.600 
50.200 
62,300 
61.500 
50.100 
53.900 
55.000 
57,000 
170.600 
56.900 
37.300 
54,100 
89.000 
50.60C 
50.300 
47.800 
56.200 
51.500 
90.500 
85.900 
67.300 
43.900 
90.80C 
50.000 
53.100 
50,400 
52.300 
48.000 
51.800 
74.400 
51.700 
41.600 
43.600 
74,500 
44.500 
77.500 
51,600 
56.900 
89.100 
17.400 
43.600 
42.500 
81.700 
43.000 
53.200 
62.300 
43.700 



MSN 


X 
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CPKol 


_ _ 
SOSMW 
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Y 


95 


1119 


536 


-9.9 


53.800 


174 


1364 


163 


96 


1731 


756 


•2.0 


40.700 


175 


825 


393 


97 


1033 


566 


-11.4 


51.600 


177 


1562 


553 


96 


1406 


565 


4.1 


51.700 


178 


1321 


710 


96 


578 


1149 


-23.6 


25.000 


179 


1089 


615 


100 


2004 


538 


>0.0 


53.700 


180 


1866 


567 


101 


1106 


623 


-10.1 


47.900 


181 


411 


295 


102 


482 


455 


•28.5 


61.300 


182 


804 


730 
696 


103 


665 


830 


-20.2 


37.300 


164 


1860 


104 


773 


1182 


-17.0 


23.800 


185 


1997 


1017 


105 


312 


1117 


<-35.0 


26.100 


186 


279 


1113 


106 


1769 


509 


-1.5 


56.100 


187 


773 


296 


107 


1565 


720 


-3.6 


4Z50C 


188 


1538 


807 


106 


1692 


807 


-2.4 


38.300 


191 


1560 


674 


109 


1482 


593 


•4.8 


49.700 


192 


1818 


687 


110 


778 


516 


-16.9 


55.500 


193 


1469 


555 


111 


1726 


700 


-2.0 


43.500 


194 


1380 


266 


113 


1191 


680 


-8.9 


44,500 


195 


784 


632 


114 


1296 


185 


-7.5 


160,800 


196 


1227 


1185 


115 


682 


907 


-19.6 


34,100 


197 


667 


553 


116 


1146 


610 


-9.5 


48,700 


196 


2006 


681 


117 


1546 


849 


•4.1 


36,500 


199 


1711 


674 


118 


1050 


577 


-11.1 


50,800 


2a 


872 


424 


120 


1530 


628 


-4.3 


37,4a 


201 


292 


435 


121 


836 


423 


-15.4 


65,200 


202 


736 


253 


122 


1572 


712 


•3.8 


42.900 


203 


786 


829 


123 


23 


1433 


<-35.0 


15.300 


204 


1224 


589 


124 


621 


1474 


-21.9 


13,900 


205 


439 


963 


125 


1296 


662 


-7.5 


36.00C 


206 


1994 


571 


126 


872 


921 


-14.7 


33.50C 


207 


1895 


687 


127 


1000 


717 


-12.0 


42.600 


208 


240 


1418 


128 


1229 


311 


44 


86,100 


210 


17a 


499 


129 


1422 


832 


-5.8 


37.3a 


211 


902 


517 


130 


1776 


499 


-14 


57.0a 


213 


1067 


684 


131 


1930 


757 


•0.1 


40.7a 


214 


1340 


668 


132 


660 


537 


•204 


53.8a 


215 


1591 


495 


133 


666 


1019 


-20.2 


29.7a 


216 


1585 


755 


134 


1271 


862 


-7.9 


36.oa 


217 


1159 


393 


135 


1161 


1389 


-9.3 


i6.sa 


218 


931 


572 


136 


453 


1063 


-29.7 


28,ia 


219 


713 


177 


137 


1858 


823 


-0.6 


37.7a 


220 


1479 


911 


136 


1504 


697 


-4.6 


43,7a 


221 


965 


927 


139 


1488 


707 


-4.8 


43,2a 


223 


934 


716 


140 


1689 


756 


-24 


40.7a 


225 


1812 


1045 


141 


311 


1417 


<-35.0 


15.8a 


226 


821 


411 


142 


1366 


915 


-6.7 


33.8a 


227 


1586 


1483 


143 


1429 


346 


-5.7 


77,9a 


228 


1065 


567 


144 


615 


1017 


-22.1 


29,8a 


229 


1577 


690 


145 


2006 


566 


>0.0 


51.6a 


230 


1458 


496 


146 


2006 


518 


>0.0 


55.3a 


232 


1440 


849 


147 


1070 


1108 


-10.7 


26.5a 


234 


1692 


489 


148 


1347 


578 


-6.9 


so.ea 


235 


618 


10O4 


149 


541 


1481 


-25.7 


13.7a 


236 


920 


1136 


150 


1645 


760 


-2.8 


40,5a 


237 


952 


1006 


151 


1269 


236 


-7.9 


u7.oa 


238 


1611 


541 


152 


1507 


911 


-4.5 


33,9a 


239 


1489 


720 


153 


1722 


448 


-2.1 


62.ia 


240 


501 


448 


154 


932 


503 


-13.5 


56.6a 


241 


1820 


569 


155 


1031 


294 


-11.4 


91.4a 


242 


1357 


656 


156 


1970 


684 


>0.0 


44,400 


243 


711 


1182 


157 


1258 


183 


-8.1 


162.4a 


244 


1855 


621 


158 


1275 


417 


-7.8 


65,9a 


245 


1189 


474 


159 


1663 


820 


-2.6 


37.8a 


246 


551 


459 


160 


1034 


527 


-11.4 


54.6a 


247 


1346 


604 


161 


1953 


771 


>0.0 


40,0a 


248 


460 


448 


162 


1020 


1482 


-11.6 


13.7a 


249 


1733 


451 


164 


1566 


606 


-3.8 


38.4a 


CSV 






166 


1905 


565 


-0.2 


517a 


251 


806 


392 


167 


1340 


181 


-7.0 


164.9a 


252 


874 


553 


168 


1506 


583 


-4.6 


50.400 


253 


753 


848 


169 


1338 


678 


-7.0 


447a 


254 


995 


450 


170 


1969 


541 


>0.0 


53.5a 


255 


1690 


679 


171 


800 


378 


-16.3 


71.8a 


256 


994 


1006 


172 


476 


956 


-28.7 


32.1a 


257 


508 


464 


173 


919 


1314 


-13.7 


19.3a 


256 


1517 


820 



CPKol SOSMW 



-6.7 
-15 7 
-3.6 
-7.2 
-104 
-0.5 
•32.1 
•16.2 
-0.6 
>0.0 
<-35.0 
-17.0 
-4.2 
-3.9 
-0.9 
-5.0 
-64 
-16.7 
-84 
-20.1 
>0.0 
-2.2 
-14.7 
<-35.0 
-18.0 
-16.7 
-8.5 
•30.9 
>0.0 
-0.3 
<-35.0 
2.3 
-14.1 
-10.4 
-7.0 
-3.5 
•3.6 
-9.3 
-13.5 
•18.7 
-4.9 
-12.8 
-13.5 
-1.0 
-15.8 
-3.6 
-10.8 
-3.7 
-5.2 
•5.5 
-2.4 
-22.0 
-13.7 
-13.1 
-3.2 
-4.8 
-27.7 
-0.9 
-6.8 
-18.7 
-0.6 
-6.9 
-25.1 
-6.9 
-29.3 
-1.9 
>0.0 
-16.1 
-14.6 
-17.6 
-12.1 
-2.4 
-12.1 
-27.4 
•44 ' 



162.9a 
69,3a 

szea 
43.0a 

48.3a 

51.6a 
91.2a 
42.0a 
* 34,5a 

29.8a 
26.3a 
90.8a 
38.4a 

44.9a 
44,2a 

52.4a 

101.6a 
47.3a 

23.7a 
52.6a 

44.5a 
44.9a 

65.0a 

63,7a 
107.8a 

37.4a 
50.0a 
31.1a 
51.3a 
44.2a 
15.8a 
57.0a 
55.4a 
44.4a 
45.2a 
57.3a 
40.7a 

69.3a 

51.2a 
170.5a 
33.9a 
33.3a 
42.7a 

28.8a 
66.8a 

13.6a 
51 .6a 

34.800 
57.3a 
36.5a 

57.9a 
30.3a 

25.4a 

30.2a 

53.5a 

42.5a 
62.1a 
51,4a 
45,8a 
23.ea 

48.0a 

59.3a 

61.000 

49.1a 
62,1a 

61.800 

39.2a 

69.5a 

52.5a 

36.5a 

61.9a 

44.6a 

30.2a 
60,4a 
37.aa 



. _ _ J^VI 

predicted molecular mass (from the standard curve of Fig. 8). 



.isoelectric point relative to CPK standards, and 



924 



MSN 



L AodcnoB ft *l. 



Y CPW SOSMW 



MSN 



Y CPKpt SOSMW 



250 
260 
261 
262 
263 
265 
266 
267 
266 
260 
270 
271 
272 
274 
275 
276 
277 
278 
270 
281 
282 
283 
284 
285 
286 



290 
291 
292 
293 
294 
295 
296 
297 
299 
300 
301 
302 
303 
304 
305 
306 
307 
308 
309 
310 
311 
312 
313 
314 
315 
316 
320 
321 
322 
323 
324 
325 
326 
327 
328 
330 
331 
332 
333 
334 
335 
336 
336 
339 
340 
341 
343 
344 



1706 
661 
1725 
406 
1063 
1390 
510 
660 
430 
1044 
2010 
657 
695 
1202 
1350 
1670 
686 
061 
670 
1646 
1505 
1313 
1314 
1332 
1277 
1301 
1147 
02S 
787 
1462 
531 
860 
1162 
218 
1377 
913 
2012 
702 
494 
403 
1643 
1049 
1606 
1219 
1627 
1524 
1760 
1609 
266 
1902 
1316 
1341 
1104 
1480 
850 
1454 
670 
655 
1521 
1587 
1388 
448 
1608 
1566 
531 
784 
1050 
1503 
1616 
1654 
1265 
581 
1407 
1351 
1813 



061 
1361 
670 
1127 
172 
673 
437 
1038 
061 
606 
853 
422 
068 
712 
500 
1089 
538 
718 
570 
1084 
525 
1147 
829 
408 
652 
824 
579 
511 
1476 
818 
449 



609 
814 
979 
1523 
667 
178 
1280 
1008 
1585 
593 
989 
916 
755 
692 
1028 
1451 
1406 
1365 
1395 
523 
1053 
1459 
603 
1404 
626 
101 
675 
677 
400 
1201 
751 
607 
471 
1156 
407 
303 
506 
1004 
888 
565 
1047 
265 
540 



-1.1 
-20.4 
•2.0 
2B.0 
10.0 
-6.3 
•27.3 
-204 
■31.0 
-11.2 
>0.0 
-15.0 
-14.2 
-7.6 
-6.0 
-£6 
-10.4 
-13.0 
-14.5 
-0.7 
^4.6 
•7.3 
-7.3 
•7.1 
-7.8 
-6.3 
-0.5 
-13.6 
-16.6 
-5.1 
-26.3 
-14.9 
-9.3 
<-35.0 
-6.5 
-13.9 
>0.0 
-19.0 
•28.1 
-32.6 
-0.7 
-11.1 
-3.3 
-8.5 
-3.0 
•4.4 
-1.5 
-3.3 
<-35.0 
-0.3 
-7.3 
-7.0 
-10.1 
-4.9 
•15.1 
-5.3 
-20.0 
•20.6 
-4.4 
-3.6 
-6.3 
-30.0 

-3.8 
-26.3 
-16.7 
-10.9 

-3.5 

•3.2 

-0.6 

-8.0 
-23.6 

-4.7 

-6.8 

-0.9 



31.000 
17.700 
44.600 
25.800 
177,400 
45.000 
63.400 
20.000 
31.000 
48,000 
36.300 
65.200 
31,700 
42.000 
49.000 
27,100 
53,700 
42.600 
51,300 
27.300 
54,800 
25.100 
37,400 
67.200 
46.100 
37,600 
50,700 
55,000 
13.000 
37.800 
62,000 
43.600 
48,700 
38,000 
31.300 
1Z400 
45.300 
168.200 
20,400 
30.100 
10,300 
49.800 
30.900 
33,700 
40.700 
34,700 
29,400 
14,700 
16.100 
17.600 
16,600 
54.900 
28.500 
14,400 
49.100 
13.300 
47.700 
420,500 
44.800 
44,700 
67.000 
20.100 
40.000 
43.700 
50.600 
24,700 
67.300 
88.500 
49.400 
30,300 
34,000 
50.300 
28.700 
102.200 
52.600 



345 
346 
347 
348 
340 
350 
351 
352 
353 
354 
355 
356 
357 
356 
350 
360 
361 
362 
363 
364 
365 
366 
367 
368 
369 
370 
371 
372 
373 
374 
375 
376 

377 

378 

379 

381 

382 

363 

384 

385 

386 

387 

386 

389 

390 

391 

392 

393 

394 

395 

396 

397 

399 

400 

401 

403 



405 

406 

409 

410 

41 i 

412 

413 

415 

416 

417 

418 

419 

420 

421 

422 

423 

424 

425 



1006 
1095 
625 
361 
110 
521 
012 
1574 
061 
706 
1450 
1374 
474 
796 
764 
1384 
1713 
1161 
. 914 
412 
741 
878 
1560 
963 
434 
639 
1587 
1875 
1351 
1506 
1823 
254 
1409 
621 
1017 
953 
856 
1252 
1699 
1042 
1490 
1554 
1193 
1374 
1456 
718 
1799 
1482 
1227 
1530 
1410 
912 
1465 
1473 
1029 
1516 
1495 
1525 
723 
650 
1501 
936 
350 
1033 
737 
1578 
646 
1695 
725 
1289 
1171 



929 
739 
1490 



578 
640 
728 
083 
1343 
1130 
619 
530 
912 
762 
830 
1152 
997 
346 
338 
1066 
769 
859 
1156 
435 
486 
1503 
935 
520 
441 
610 
860 
762 
1059 
715 
532 
417 
563 
494 
595 
598 
674 
258 
1518 
493 
563 
603 
404 
902 
969 
690 
732 
758 
1461 
577 
755 
256 
1063 
450 
1140 
754 
554 
1092 
252 
663 
476 
1057 
1120 
536 
425 
606 
496 
482 
770 
1041 
912 
162 
856 
625 
965 



-11.9 
-10.3 
-21.7 
-35.3 
<-35.0 
-26.7 
-13.9 
-3.7 
•12.9 
-18.9 
-5.3 
•6.5 
-28.7 
-16.3 
•17.3 
-6.4 
-i1 
-9.3 
•13.8 
-32.0 
-17.9 
-14.6 
-3.9 
-12.4 
-31.0 
-21.2 
-3.6 
-0.5 
-6.8 
-4.6 
-0.9 
<-35.0 
-6.1 
-21.8 
-11.7 
-13.1 
•15.0 
-8.1 
•Z3 
-11.2 
-4.7 
-4.0 
-8.9 
•6.5 
-5.2 
-18.5 
-1.1 
-4.8 
-8.4 
-4.3 
-6.0 
-13.9 
-5.0 
-4.9 
-11.5 
-4.4 
-4.7 
-4.3 
-18.4 
-20.8 
-4.6 
-13.4 
-35.9 
-11.4 
-18.0 
-3.7 
-21 .0 
-2.3 
-18.3 
-7.7 
-9.1 
-22.8 
-13.6 
-17.9 
-4.7 



50.800 
46,800 
42.000 
31.100 
18.300 
25.700 
48.100 
54.300 
33.900 
40,400 
37.300 
24.900 
30,600 
77,800 
79,400 
27.900 
40.100 
36,100 
24.800 
63.700 
58.200 
13.000 
33.000 
55.200 
63.000 
48.700 
36.100 
40,400 
28.300 
42.700 
54,200 
65.900 
50,400 
57,500 
49,600 
49,400 
44,000 
105,300 
12.500 
57.500 
50.400 
49.100 
67.700 
34,300 
31.700 
44,000 
41,900 
40,600 
14,400 
50.800 
40,800 
106,400 
28,100 
61,900 
25,300 
40.800 
52.500 
27.100 
106,000 
45.500 
59,000 
28.300 
26,000 
53,700 
64.900 
48.900 
57,300 
58,600 
40.000 
28.900 
33.900 
193,700 
36.200 
47.700 
31.800 




426 
427 
428 
429 
430 
431 
432 
434 
435 
436 
437 
438 
439 
440 
441 
443 
446 
447 
448 
449 
450 
451 
452 
453 
454 
456 
457 
450 
460 
461 
462 
463 
464 
465 
466 
468 
469 
470 
471 
472 
473 
474 
475 
476 
477 
478 
479 
480 
462 
483 
485 
486 
487 
488 
489 
490 
491 
492 
493 
494 
495 
496 
497 
499 
500 
501 
502 
503 
504 
505 
506 
507 
506 
509 
510 



1296 
810 
1565 
1250 
1253 
734 
483 
518 
1020 
1122 
1870 
435 
86 
1740 
599 
743 
801 
1050 
1245 
1576 
1818 
1094 
1945 
1652 
1403 
1394 
905 
1038 
1598 
1528 
1098 
849 
1614 
1388 
1194 
577 
1140 
1797 
1293 
618 
2009 
1205 
1035 
160 
469 
599 
1009 
1216 
816 
683 
1608 
478 
1025 
1045 
1609 
775 
692 
1100 
1760 
862 
470 
494 
960 
1414 
1234 
1246 
624 
1246 
1115 
1189 
1578 
787 
970 
1153 
1730 



704 
843 

303 
847 

562 
1426 
433 
1041 
1170 
196 
673 
1102 
847 
544 
1571 
335 
666 
926 
1296 
1516 
1021 
440 
802 
894 
500 
718 
436 
581 
294 
863 
1137 
1125 
1072 
481 
1084 
467 
888 
524 
1133 
655 
299 
215 
788 
155 
1370 
662 
540 
235 
346 
673 
1013 
599 
607 
1186 
301 
1289 
178 
964 
776 
247 
1256 
1436 
652 
546 
1072 
659 
792 
1134 
1407 
391 
402 
250 
552 
619 
1006 



-7.t> 
16 0 
•3.9 
-6.0 
-8.1 
•18.1 
•28.5 
-26.9 
•11.6 
•9.8 
-0.5 
•31.0 
<-35.0 
-1.8 
-22.8 
-17.8 
•16.2 
-11.1 
-8.2 
-3.7 
-0.9 
-10.3 
>0.0 
-2.8 
-6.1 
-6.3 
-14.0 
-11.3 
-34 
-4.3 
-10.2 
-15.2 
-0.9 
-6.3 
•8.9 
•23.9 
-9.6 
-1.1 
•7.6 
-21.9 
>0.0 
-8.7 
-11.4 
<-35.0 
•28.9 
•22.8 
-11.8 
-8.6 
-15.9 
•19.3 
-3.3 
-28.6 
•11.5 
-11.2 
-3.3 
-17.0 
-19.3 
-10.2 
-1.6 
-14.5 
•28.9 
-28.1 
-12.5 
-6.0 
-8.3 
-8.2 
-15.7 
-6.2 
•9.9 
-8.9 
-3.7 
-16.6 
-12.5 
-94 
-2.0 



366QQ 
36QCQ 

'5.SCC 
63.9CC 
28,90c 
*<3CC 
K7.«Ct 

<5.aa: 

26.7CC 
36.0CC 

10.BCC 
80, ICC 
45.2CC 
33.30c 
19.8QC 

lieoc 

29.60C 
63.10C 
38.60C 
34.600 
56 90C 
4Z600 
63.SK 
50.500 
91.400 
35.90C 
25 4X 
25 ax 
27.80C 
58.700 
27,300 
60.100 
34.900 
54.800 
25.500 
46.000 
89.900 

131.300 
39.20C 

207.600 
17.400 
45.600 
53.500 

117.400 
77.800 
44.900 

30.000 

49.300 

48.800 

23.7V 

89.200 

20.100 

1 69.300 

31.800 

39.700 
110.700 

21-200 

15*00 

3*400 

SM* 

27.800 

45700 

39.000 

25,500 

1&20O 

6BL» 



f X 

809 
fl099 
h696 
948 
481 
[*1334 
868 
796 
822 
632 
1*1332 
603 
►•1190 
479 
768 
747 
M170 
[1502 
H 1728 
507 
670 
11347 
M513 
306 
t-1851 
! 1463 
|t 009 
625 
1164 
803 
1259 
856 
803 
1162 
128 
,1355 
505 
[1360 
. 002 
LH2S 
705 
1477 
080 
700 
1028 



109.** 
52.6* 
48.100 
3O10I 
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Daubase of rai liver proteins 



925 



Y CPKd SOSMW 



511 

512 

513 

514 

515 

516 

517 

518 

519 

520 

521 

522 

523 

524 

525 

526 

527 

526 

530 

532 

533 

534 

535 

536 

536 

539 

540 

541 

542 

S43 

544 

545 

546 

547 

546 

549 

550 

552 

553 

555 

556 

557 

556 

558 

560 

562 

564 

565 

566 

567 

569 

570 

571 

573 

574 

575 

576 

577 

578 

579 

560 

581 

562 

564 

585 

566 

567 

588 

569 

590 

591 

592 

593 

594 

595 



609 
1099 
1696 
948 
481 
1334 
866 
796 
822 
632 
1332 
603 
1190 
479 
768 
747 
1170 
1502 
1728 
507 
870 
1347 
1513 
308 
1851 
1463 
909 
625 
1164 
803 
1259 
856 
803 
1162 
128 
1355 
595 
1369 
992 
1125 
705 
1477 
980 
700 
1028 
896 
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1003 
1006 
1007 
1009 
1010 
1011 
1012 
1013 
1014 
1015 
1016 
1017 
1018 
1020 
1021 
1022 
1023 
1024 
1025 



1815 
1205 
617 
968 
970 
1736 
643 
822 
875 
291 
1386 
459 
679 
1818 
1032 
1629 
1311 
1722 
1015 
1574 
781 
1129 
812 
785 
1290 



827 
865 
472 
498 
491 
2G9 
423 
736 
344 
665 
193 
152 
701 
547 
712 
816 
174 
419 
409 
320 
334 
1155 
255 
796 
154 
1048 
206 
232 
437 
567 
495 
961 
295 
664 
642 
1141 
642 
911 
1508 
317 
1105 
1159 
555 
361 
317 
928 
701 
811 
461 
647 
579 
504 
289 
290 
771 
478 
1184 
487 
279 
644 
745 
541 
661 
1128 
634 
994 
1134 
424 
743 
1219 



63 
317 
446 

739 



-6.8 
-1.5 
•22.7 
<-35.0 
-12.1 
•7.5 
*2l .6 
<*35.0 
-6.5 
-1.5 
•11.3 
-14.9 
•13.0 
•27.6 
>0.0 
•11.8 
-17.2 
•23.0 
-248 
•14.4 
-24.5 
-12.8 
•20.0 
•6.7 
-13.9 
-22.3 
-7.7 
-15.8 
-12.6 
-32.6 
<-35.0 
•15.3 
-9.8 
•12.1 
-3.2 
-17.7 
•10.8 
-8.8 
-1.6 
-6.9 
•11.5 
-17.9 
-15.9 
-16.7 
-9.3 
-10.4 
•11.5 
-15.2 
•14.1 
•14.4 
-0.9 
^8.7 
-220 
-128 
-12.7 
•1.9 
-21.1 
•15.8 
-146 
<-35.0 
-6.4 
-294 
-19.7 
-0.9 
-114 
-3.0 
-7.4 
-2.0 
•11.7 
-3.7 
-16 6 
-9.7 
-15.9 
-16.7 
-7.7 



37.500 
35.000 
59.60c 
57.1CC 
57.7* 

10 0.30c 
65.10C 
41.60c 

45.400 
U1.0CC 
213.000 
43.400 
53.000 
42.900 
37.900 
174.9CC 
65.7QC 
67.10C 
63.9CC 
80.500 
24.800 
106.6CC 
38.700 
210.300 
28,700 
138.9a 
119.30C 
63 400 
51.60C 
57.40C 
31.20C 
91,100 
45.400 
46.700 
25.300 
46.700 
33.900 
12.800 
84,700 
26.600 
24 600 
52.4CC 
74.900 
84.500 
33.300 
43.400 
38.200 
60.700 
36.600 
50.700 
56.500 
93.100 
92.700 
40.000 
58.900 
23.700 
58.100 
96.400 
46.600 
41.200 
53.500 
45.600 
25.800 
47.200 
30.700 
25.500 
65.000 
41.300 
22. 5°° 
58.400 

591.30° 
84.60° 
62.400 

41.500 
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MSN 



t0» 405 
1290 

USB 856 

1030 1284 

1031 986 

1032 1547 

1033 1381 

1034 1525 



Y CPK* 



SOSMW 



552 



1035 1128 

1036 1226 

1039 1 761 

1040 541 

1041 818 

1044 1036 

1045 1439 

1047 1540 

1048 1576 

1049 1089 

1050 949 
1061 426 
1052 1583 

1063 779 

1064 1613 

1065 1380 



10S6 


284 


1068 


1261 


1060 


393 


1061 


1817 


1062 


1245 


1064 


1258 


1065 


705 


1066 


1181 


1067 


529 


106B 


508 


1069 


1898 


1071 


873 


1073 


1768 


1075 


836 


1076 


1863 


t078 


826 


1081 


971 


1063 


1697 


1065 


1157 


1090 


620 


1092 


1867 


1093 


2019 


1094 


1546 


1095 


1545 


1096 


61 


1099 


1954 


hoi 


588 


M02 


1050 


(103 


457 


105 


1884 


106 


1714 


107 


1717 


1106 


1976 


H11 


547 


^112 


1348 


:115 


1385 


116 


1078 


'117 


975 


118 


1202 



'119 1022 

! 120 1905 

'121 1512 

:1 22 ni4 

123 1464 

!125 1048 

126 1122 

J 28 1722 

133 1098 

139 1830 

1«7 764 

i<8 1968 



547 
226 
822 
403 
551 
496 
645 
274 
262 
839 
910 
465 
407 
250 
635 
411 
1040 
818 
1385 
1092 
620 
377 
663 
746 
605 
645 
746 
792 
934 
734 
658 
696 
604 
609 
1128 
773 
661 
566 
463 
202 
794 
910 
597 
894 
538 
477 
935 
237 
1048 
667 
797 
532 
649 
546 
722 
1066 
621 
762 
616 
787 
933 
1076 
616 
1301 
677 
452 
857 
802 
892 
825 
569 
1182 
724 



-323 
-7.5 
-15.0 
-7.7 
-12.3 
-4.1 
-64 
-4.3 
-6.7 
4.5 
-1.6 
-25.7 
-15.8 
-11.3 
-5.5 
-4.2 
-3.7 
-10.4 
-13.2 
•31.1 
-3.6 
-16.6 
-3.2 
•6.5 
<-35.0 
-8.0 
-33.3 
-0.9 
-8.2 
-8.1 
•18.9 
-9.0 
•26.3 
-27.4 
-0.3 
-14.7 
-1.5 
-15.4 
-0.6 
-15.7 
-12.7 
-2.3 
-9 4 
-21.9 
-0.5 
>0.0 
-4.1 
-4.1 
<-35.0 
>0.0 
-23.3 
-11.1 
-29.5 
-0.4 
-2.1 
-2.1 
>0.0 
-25.3 
-6.9 
-6.4 
-10.6 
-12.6 
-8.7 
-11.6 
-0.3 
-4.5 
-9.9 . 
-5.1 
-11.1 
•9.8 
-2.1 
•10.2 
-0.8 
-17.3 
>0.0 



52.600 
36.500 
53.000 
123.200 
37.700 
67,900 
52.700 
57.200 
46.500 
98.300 
103.600 
36.900 
34.000 
56.300 
67,300 
109.200 
47.100 
66,700 
26.900 
37.800 
16.900 
27.000 
48,000 
72.000 
45.500 
41.200 
49.000 
46,600 
41.200 
39.000 
33,000 
41.800 
45.800 
43,700 
49,100 
48,700 
25.800 
39,900 
36.000 
51.600 
58,500 
142.300 
38.900 
34,000 
49.500 
34.600 
53.700 
59.100 
33,000 
116.000 
28.600 
45.200 
38.800 
54,200 
46.300 
53.100 
42.400 
28.000 
46.000 
40,400 
38.000 
39.300 
33.100 
27.600 
48.300 
19.700 
44,700 
61.700 
36.200 
38.600 
34.700 
37.500 
51.400 
23.800 
42.300 



1153 
1154 
1161 
1162 
1163 
1168 
1170 
1171 
1172 
1174 
1176 
1177 
1178 
1179 
1180 
1181 
1182 
1183 
11S4 
1185 
1186 
1189 
1190 
1191 
1192 
1193 
1194 
1195 
1196 
1197 
1198 
1199 
1200 
1201 
1202 
1203 
1204 
1205 
1206 
1209 
1210 
1211 
1212 
1214 
1215 
1216 
1217 
1218 
1219 
1220 
1221 
1222 
1223 
1224 
1Z25 
1226 
1227 
1228 
1229 
1230 
1231 
1232 
1233 
1234 
1235 
1236 
1237 
1238 
1239 
1240 
1241 
1242 
1243 
1244 
1245 



921 
1564 

637 
623 
665 
564 
552 
538 
545 
1099 
1304 
1366 
1606 
1485 
1459 
1431 
1407 
1383 
1454 
1422 
1394 
1171 
1457 
686 
265 
403 
344 
505 
572 
639 
637 
614 
637 
1095 
1719 
791 
964 
313 
306 
320 
326 
394 
402 
366 
641 
660 
914 
873 
970 
1021 
1392 
1354 
1362 
673 
614 
603 
696 
707 
475 
466 
759 
1324 
1583 
1865 
1812 
1411 
1392 
794 
769 
740- 
743 
713 
682 
663 
565 



1158 
864 

400 

397 
397 
528 
529 
524 
514 
522 
586 
539 
702 
224 
224 
223 
223 
224 
182 
183 
162 
214 
286 
1114 
893 
1292 
1275 
1311 
1293 
1502 
1402 
1407 
1431 
1394 
1545 
666 
1021 
195 
194 
197 
197 
294 
294 
294 
329 
329 
266 
245 
372 
296 
205 
203 
205 
540 
542 
539 
623 
628 
447 
1282 
1461 
1170 
1005 
809 
817 
703 
682 
410 
407 
406 
511 
510 
509 
504 
582 



-13.7 
-3.5 
-21.3 
-21 .8 
-20.2 
24 4 
-25.0 
-25.9 
•25.5 
-10.2 
-7.5 
-6.6 
-3.3 
-4.8 
.*5-2 
-5.7 
-6.1 
-64 
-5.3 
•5.8 
-6.3 
-9.2 
-5.2 
-19.5 
<-35.0 
-32.6 
<-35.0 
•27.6 
-24.1 
-21.2 
-21.3 
^22.1 
-21.3 
-10.3 
-2.1 
-16.5 
-12.9 
<-35.0 
<-35.0 
<-35.0 
<-35.0 
-33.2 
-32.7 
-33.7 
-21.2 
-20.4 
-13.8 
-14.7 
-12.7 
-11.6 
■6.3 
■6.8 
-6.7 
-19.9 
-22.1 
•22.6 
•19.2 
-18.9 
-28.7 
-29.0 
-17.4 
-7.2 
•3.6 
-0.6 
-1.0 
-6.0 
-6.3 
-16.4 
-17.1 
-17.9 
-17.8 
-18.7 
•19.6 
-20.3 
•24.4 



24.700 
35.900 
68.400 
68.800 
68.700 
54.500 
54.500 
54,800 
55.700 
55.000 
50.200 
53.700 
43,400 
124.900 
124.900 
125.100 
125.200 
124.700 
164.400 
162.600 
164,300 
131.800 
94.200 
26.200 
34,700 
20.000 
20.600 
19.400 
20.000 
13.000 
16.300 
16.200 
15.400 
16.600 
11.600 
45.200 
29.700 

148.700 

149,800 

147,400 

146,600 
91.400 
91.200 
91.400 
81.600 
61.600 

101,800 

112,000 
72.900 
90,100 

139.500 

141.800 

139.500 
53,600 
53,400 
53.600 

47,800 

47.500 

62.300 

20,400 

14,400 - 

24.200 

30,300 

38.200 

37,900 

43,400 

44,500 

66.900 

67.300 

67.500 

55.900 

56,000 

56,100 

56.500 

50.500 



1246 

1247 

1249 

1250 

1251 

1252 

1253 

1254 

1255 

1257 

1258 

1259 

1260 

1261 

1262 

1263 

1264 

1265 

1266 

1267 

1268 

1269 
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1271 

1272 

1273 

1274 

1277 

1278 
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1281 

1282 

1283 
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1285 

1286 
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1289 
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1292 

1293 

1294 

1295 



547 
530 
516 
973 
607 
665 
899 
1311 
1300 
1938 
1806 
1727 
1629 
1555 
1468 
1413 
1340 
1263 
1182 
1110 
1055 
999 
959 
905 
857 
810 
774 
737 
702 
671 
645 
617 
595 
573 
552 
536 
515 
496 
467 
447 
427 
412 
397 
381 
365 
348 



577 

576 

572 

536 

532 

529 

766 

746 

761 

712 

718 

715 

713 

717 

717 

722 

717 

717 

720 

717 

717 

717 
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712 
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706 
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707 

704 

700 

695 

694 

687 

683 

669 

667 

655 

655 

652 

654 

653 

653 < 



-25.3 
-26.3 
-27.0 
-12.7 
•224 
-20.2 
-14.1 
-7.4 
-7.5 
0.0 
-1.0 
-2.0 
-3.0 
-4.0 
-5.0 
-6.0 
'-7.0 
-6.0 
4.0 
-10.0 
-11.0 
•12.0 
-13.0 
-14.0 
-15.0 
-16.0 
-17.0 
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50.800 

50.900 
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54.400 

40.200 

41.200 

40.400 

42.900 

42.600 

42.700 
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42.600 
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42.600 
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44,400 

45.200 
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46.100 
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Protein Name 



PIR *ASP #GLU 
Name 3.9 4.1 



#HIS tLYS 
6.0 10.6 



Rabbit muscle CPK KIRBCM 



#ARG NH2- Caic Real 
12.5 7.0 o! CPK 



Hb-beta, human 



HBHU 



28 
28 
28 
28 
28 
28 
28 
28 
28 
28 
28 
28 
28 
28 
28 
28 
28 
28 
28 
28 
28 
28 
28 
28 
28 
28 
26 
28 
28 
28 
28 
28 
28 
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28 
26 
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7 
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27 
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8 
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8 

8 
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10 
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7 

6 

5 
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3 

2 

1 

0 
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18 
18 
18 
18 
18 
18 
18 
18 
18 
18 
18 
18 
18 
18 
18 
18 
18 
16 
18 
18 
18 
18 
18 
18 
18 
16 
18 
18 
18 
18 
18 
18 
18 
18 
18 
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9 
9 
9 
9 
9 
9 

9 

9 

9 

9 

9 

9 

9 



11 
10 
9 
8 
7 
6 
5 
4 
3 
2 
1 
0 
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3 
3 
3 
3 
3 
3 

3 

3 

3 

3 

3 

3 



6.84 

6.67 

6.54 

6.42 

6.31 

6.21 

6.12 

6.03 

5.94 

5.85 

5.76 

5.67 

5.58 

5.48 

5.39 

5.29 

5.20 

5.12 

5.04 

4.96 

4.89 

4.83 

4.77 

4.71 

4.66 

4.61 

4.56 

4.52 

4.48 

4.44 

4.40 

4.36 

4.32 
4.29 
4.25 
4.22 



7.18 
6.79 
6.53 
6.32 
6.13 
5.96 
5.78 
5.59 
5.37 
5.14 
4.91 



-1.8 
-3.2 
-5.3 
-7.2 
-10.0 
-12.3 
-15.5 
-18.0 
-21.0 



4.71 -25.5 
4.54 -27.2 
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Table 4. Computed pf i of some known proteins related to measured CPK p/"$ 



Protein Name 



PIR #ASP #GLU #HIS #LYS #ARG Cate 
Name 3.9 4.1 6.0 10.8 12^ D , 



0 


Creatine phospho kinase (CPK). rabbit muscle 


KIRBCM 


28 


27 


17 


34 


18 


1 


Fatty acid-binding protein, rat hepatic 


. FZRTL 


5 


13 


2 


16 


2 


2 


b2-microglobulin. human 


MGHUB2 


7 


6 


4 


8 


5 


3 


Carbamoyl-phosphate synthase, rat 


SYRTCA 


72 


96 


28 


95 


56 


4 


Proalbumin ( serum albumin precursor), rat 


ABRTS 


32 


57 


15 


53 


27 


5 


Serum albumin, raa 


ABRTS 


32 


57 


15 


53 


24 


6 


Superoxid dismutase (Cu-2n, SOD), rat 


A26810 


6 


11 


10 


9 


4 


7 


Phospholipase C. p ho phoi no smoe- specific (?), rat 


A28B07 


34 


42 


9 


49 


21 


8 


Albumin, human 


ABHUS 


36 


61 


16 


60 


24 


9 


Apo A-l lipoprotein, rat 


A24700 


18 


24 


6 


23 


12 


10 


proApo A-l lipoprotein, human 


LPHUA1 


16 


30 


6 


21 


17 


1 1 


NADPH cytochrome P-450 reductase, rat 


RDRT04 


41 


60 


21 


38 


36 


12 


Retinol binding protein, human 


VAHU 


18 


10 


2 


10 


14 


13 


Actin beta, rat 


ATRTC 


23 


26 


9 


19 


18 


1 A 

1 4 


Act in gamma, rat 


ATRTC 


20 


29 


9 


19 


18 


15 


Apo A-l lipoprotein, human 


LPHUA1 


16 


30 


5 


21 


16 


16 


Apo A-1V lipoprotein, human 


LPHUA4 


20 


49 


6 


28 


24 


17 


Tubulin alpha, rat 


UBRTA 


27 


37 


13 


19 


21 


18 


FiATPase beta, bovine 


PWBOB 


25 


36 


9 


22 


22 


19 


Tubulin beta, pig 


UBPGB 


26 


36 


10 


15 


22 


20 


Protein disulphide isomerase (PD1), rat hepatic 


ISRTSS 


43 


51 


11 


51 


9 


21 


Cytochrome b5, rat 


CBRT5 


10 


15 


6 


10 


4 


22 


Ado C-II lipoprotein, human 


LPHUC2 


4 


7 


0 


6 


1 




Amino acid pi assumed in calutation: 




3.9 


4.1 


6.0 


10.8 


12.5 



6.84 
7.83 
6.09 
5.97 
5.98 
5.71 
5.91 
5.92 
5.70 
5.32 
5.35 
5.07 
5.04 
5.06 
5.07 
5.10 
4.88 
4.66 
4.80 
4.49 
4.07 
4.59 
4.44 



0.0 
•3.0 
■5.0 
*5 
•6.2 
•9.0 
-9.2 
-M 
-M.9 
•13.7 
-KJ 
•1S.6 
•16.9 
•17.2 
•16.6 
•17.S 
-19.7 
•19.8 
•21.0 
•22i 
•25.0 
•26.0 
■30.5 
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2-D Database of rat liver proteins 1977 

An updated two-dimensional gel database of rat liver 
proteins useful in gene regulation and drug effect 
studies 

We have improved upon the reference two-dimensional (2-D) electrophoretic 
map of rat liver proteins originally published in 1991 (N. L. Anderson er at 

SlnZr SiS 'I 91 ' U 90? " 930) - A l0Ul 0f 53 P r0leins < 102 ^s) are now 
identified many by microsequencing. In most cases, spots cut from wet Coo- 

TnH TnH v-h C s }" ntd ?' D * e,s wcre submitted to interna] tryptic digestion [2], 
and ndiv,dual peptides, separated by high-performance liquid chromatography 

<™,c WCF ^ se ? r ue " ced usin « a Perkin-Elmer 477A sequenator. Additional 
spots were identified using specific antibodies 



Figure 1 shows the current annotated 2-D map of F344 
rat liver, analyzed using the Iso-DALT svstem (20 X 25 
cm gels) and BDH 4-8 carrier ampholues. Both the 
map itself and the master spot number svstem remain 
the same as shown in the original publication. Table 1 
lists the important features of each identification shown 
including the gel position, pi. and M T for the most 
abundant or most basic form of each protein. Using this 
extended base of idenrified spois. a series of four 
improved calibration functions has been derived for the 
pi and SDS-M, axes (the first two of which are shown in 
Fig. 2A and B). Both forward and reverse functions are 
derived, so that one can compute the physical properties 
of a spot with a given gel location, or inversely compute 
the gel position expected for a protein having given 
physical properties: 

^ RAT LIVER « /m^RaTLIVER n < M, SEOLEVCE-DERIVEd) (1) 
^RATLIVER = /pHRATUVER X (P^SEQL*ENCE-DERFV£d) (2) 
^,GEL-DER1VED = AaTUVER Y-M, ( *RATL!VEr) (3) 

P'gei.derived =* Aatuver x-pi (^jcatltvtil) (4) 

A spreadsheet program (in Microsoft Excel) was devel- 
oped to facilitate flexible computation of p/>s from 
amino acid sequence data, and the results were entered 
into a relational database (Microsoft Access). A table of 
spot positions and sequence-derived pi's and M r y s was 
fitted with a large series of analytic equations' using 
Tablecurve (Jandel Scientific), and the four conversion 
Eqs. (1M4), relating computed pi and gel X coordinate 
or computed molecular weight and gel Y coordinate' 
were selected, based on criteria of simplicity, goodness 
of fit and favorable asymptotic behavior. Table 2 lists the 
equations and coefficients. Application of Eqs. (3) and 
(4) to a spot's *and Y coordinates, given in [1], produce 
improved M t estimates, and allow computation of pi 

£» 7 ?r Uigh T , Ande la rso ^ Lai* Scale Biology Corpora- 

lion, 9620 Medical Center Drive. Rockvilie, MD 20850-3338 USA (Tel* 
+30M24-5989; Fax: +301-762-4892; email: Ieigh01sbc.com) 

Keywords: Two-dimensional polyacrylamide gel electrophoresis / Liver 
/ Map / Identification / Calibration 

© VCH Verlig$»e$ellKhift rabH. 69451 Weinheim. 1995 



directly in pH units, instead of in terms of positions rela- 
tive to creatine phosphokinase (CPK) charge standards. 
The inverse Eqs. (1) and (2) were used to compute the 
gel positions of a series of pi and M, tick marks. These 
tick marks were plotted with SigmaPlot (Jandel), 
together with fiducial marks locating several prominent 
spots, and the resulting graphic was aligned over the syn- 
thetic gel image (computed by Kepler from the master 
gel pattern) using Freelance (Lotus Development). Maps 
were printed as Postscript output from Freelance, either 
in black and white (as shown here) or in color, where 
label color indicates subcellular location (available from 
the first author upon request). We have also used the rat 
liver 2-D pattern as presented here to calibrate the pat- 
terns of other samples. Using mixtures of rat liver and 
mouse liver samples, for example, we made composite 
2-D patterns that allow use of the rat pattern to standar- 
dize both axes of the mouse pattern. This was accompli- 
shed by deriving transformations relating the fat and 
mouse X, and separately the rat and mouse Y, axes 
(Table 2, lower half; Fig. 2C and D) based on a series of 
spots that coelectrophorese in these closely related spe- 
cies. These functions were then applied to derive equa- 
tions relating the mouse liver X and Xto p/and SDS-A/, 
(Eqs. 5 and 6 below). The resulting standardized 2-D pat- 
tern for B6C3F1 mouse liver is shown in Fig. 3: 

^rMOL'SE LIVER = /raT LIVER Y-M r C/mOUSE LIVER Y-RaT LIVER Y 

(^MOUSE LIVER)) (5) 



P/. 



MOUSEUVER 



■(AT, 



RAT LIVER X-pl (/MOUSE LIVER X-RAT LIVER X 



MOUSE LIVE1 



r)) 



(6) 

A slightly more complex approach can be used to stand- 
ardize samples that have few or no spots co-electropho- 
resing with rat liver proteins. In this case, a 2-D gel is 
prepared with a mixture of the two samples, and four 
functions (forward and backward, each for X and Y) are 
derived relating each sample's own master pattern to the 
composite. The required functions are then applied in a 
nested fashion to yield the desired result (using rat 
plasma as an example): 

RAT PLASMA — /RAT LIVER Y-M, C/raT PLASMA* LIVER Y-RATUVER Y 

C/raT PLASMA Y—RAT PLASMA* LIVER Y ( ^RAT PLASM a))) 

(7) 
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Figure I. Master 2-D gel pattern of Fischer 344 rat liver proteins, annotated with 53 protein identifications and computed p/ and M T axes. 
Tentative identifications are in italic type. 



Table 1. Proteins identified in the 2-D pattern of F344 rat liver 



MSN** Protein ID b ' Protein name Identification comments Gel X* ] Experimental Gel J"' Experimental 

p/ 4) M, ai 



126 


HADO-HUMAN*' 


3-HA-3.4-DO: 3-hydroxy- 
anthranilate-3,4-dioxy- 
genase 


Internal sequence 


871.95 


5.36 


921.35 


30 207 


137, 159, 288, 


DIDH.RAT 


3 HDD: 3-hydroxysteroid 


Ab (T.M. Penning) and pure protein 


1857.52 


6.51 


822.52 


34 406 


258 




dihydrodiol reductase 












173 


MUP.RXT 


a>u globulin 


Presence in liver microsome lumen, 
abundance in kidney, pi, At, 


919.16 


5.43 


1313.81 


19 549 


38 


ACTB.HU MAN 


Aciifl 0 


Analogy with other mammalian patterns 
(e.g. human) through coelectrophoresis 


763.40 


5.19 


693.64 


41 586 


68 


ACTG .HUMAN 


Actin 7 


Analogy with other mammalian patterns 
(e.g. human) through coelectrophoresis 


779.42 


5-21 


692.26 


41 677 


693 


AFAR.RAT 


Aflatoxin Bl aldehyde 
reductase 


Internal sequence 


1993J2 


6.72 


818.60 


34 593 


28. 21, 33 


ALBU.RAT 


Albumin 


Coelectrophoresis with principal plasma 
protein 


1262.81 


5.86 


445.64 


66 354 


43 


DHAM.RAT 


Aldehyde dehydrogenase 


A'-Terininal sequence and AAA 


1317.72 


5.91 


589.03 


49 602 


96 


ARGi.RAT 


Arginase 


Internal sequence 


1730.72 


6J4 


756.02 


37 819 


117 


SUAR.RAT 


Arylsulfotransferase 


Internal sequence 


1547.96 


6.14 


849.08 


33 186 


1163, 1161, 


CR78.RAT 


BIP (GRP-78) 


Ab (F. Wiumann) 


66533 


5.01 


397 J9 


74 564 


1162,20 
















185 


CAH3.RAT 


CA-IH 


Uncertain; by comparison with mouse 


1996.60 


6.72 


1017.02 


26 887 


123 


CALM.HUMAN 


Calmodulin 


Analogy with human cellular patterns 
through coelectrophoresis 


23.05 


4.03 


1433.25 


17 419 


3, 201, 48, 39, CRTCJUT 


Calreticulin 


Ab (Lance Pohl) 


310.59 


4.34 


433.80 


68 206 



22, 24 
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Table 1. continued 



MSN 


rrotein lub) 


Protein name 


Identification comments 


Gel JT» 


Experimental 


Gel r» 


Expcnm 


1184, 1186. 


CPSM.RAT 


Carbamyl phosphate 


2-D of pure protein; comrtrrned by 


1453,56 


6.05 


181.64 


160 640 


IK 174, 118 




synthase 


^-terminal sequence and AAA 










5, 167. 157 














54.61 


CATA.RAT 


Catalase 


Interna) sequence 


2000.81 


6.73 


499.64 


58 968 


136 


COX2.RAT 


COX-II 


Ab (J. W. Taanman), confirmed by 


452.57 


4.61 


1062.67 


25 504 








internal sequence 










87 


CYB5.RAT 


Cytochrome B5 


2-D of pure protein; Ab; confirmed 


515.68 


4.73 


1370.55 


18 493 








by AAA 










41 


CK-RAr* 


Cyxokeratin 


Location in cytoskeletal fraction 


1165.12 


5.75 


569.09 


51 448 


29 


CK-RAT 0 


Cytokeraiin 


Location in cytoskeletal fraction 


743.11 


5.15 


605.23 


48 187 


5,11 


ENPL-RAT" 


Endoplasmin 


Ab (F. Wiumann) 


567.73 


4.83 


263.37 


112 194 


60 


ENOA.RAT 


EnoUse A 


Internal sequence and AAA 


1399.78 


6.00 


62334 


46 674 


27 


ER60.RAT 


ER-60 


^Terminal sequence {R. M. Van Frank) 


11 84 JO 


5.77 


523.51 


56.169 


17 


ATPB.RAT 


Fl ATPase a 


^Terminal sequence and AAA 


629.06 


4.95 


588.83 


49 620 


196 


ATP7.RAT 


Fl ATPase 6 


Internal sequence 


1227.24 


5.82 


1184.65 


22 310 


79 


F16P.RAT 


Fructose- 1.6-bis-phospbatase Uncertain; by comparison with ID in 


924.54 


5.44 


737.77 


38 858 








Garrison and Wager (JBC 257:13135-13143) 










62,78 


DHE3.RAT 


Giutamate dehydrogenase 


//-Terminal sequence and internal sequence 1887 J 9 


635 


566.92 


51 655 


125 


HAST- RAT" 


HAST-1: N-bydroxyaryi- 


Internal sequence 


1297.94 


5.89 


86135 


32 638 






amine sutfo transferase 












307 


HOl.RAT 


Heme oxygenase 1 


Uncertain; available data from internal 


1219.39 


5.81 


915.71 


30 423 


413, 1250, 






sequence 










HMCS.RAT 


HMG CoA synthase, 


Ab (J. Gerxnershausen) 


1033.48 


5.59 


538.13 


54 571 


933 




cyiosolie 












133, 144, 235 


HMCS.RAT 


HMG CoA synthase. 


Ab (J. Gennersnausen), ^-terminal 


666.40 


5.02 


1019.42 


26 811 


8. 23, 1307 




mitochondrial (frag) 


sequence (Steiner/Lotispeich) 










HS7C.RAT 


HSC-70 


Positional homology (with human, etc.) 


811.87 


5.27 


425.76 


69 521 


15. 25, 110 


P60.RAT 




through coeiectrophoresis 










HSP-60 


Ab (F. Witzman); confirmed by N-terminal 


845.09 


532 


520.03 


56 561 








sequence and AAA 










971 


HS70-RAT*' 


HSP-70 


Ab (F. Witzman) 


976.11 


5.51 


437.14 


67 674 


1216 1215 90 

14 1Q, 1* k J, 7v 




HSP-90 


Ab (F. Witzman) 


659.86 


5.00 


329 


90 107 


256 


INGI-HUMAN 


Interferon-Y induced 


Internal sequence 


993.85 


534 


1006.04 


27 237 




LAMB-RAT 0 


protein 












415, 734 




Positional homology with human through 


737.10 


5.14 


425.19 


69 615 


- 


LAMR-RAT** 




coeiectrophoresis, nuclear location 










80 


*Laminin receptor' 


Internal sequence 


534.02 


4.77 


697.62 


41 327 


227 


FABL.RAT 


L-FABP (liver fatty acid 


Ab (N. M. Bass) 


1586.09 


6.18 


1483.43 


16 £22 






binding protein) 










134 


MDHC.MOUS 
E 


Malate dehydrogenase 


Internal sequence 


1270.85 


5.86 


861.96 


32 620 


18, 35, 226 


GR75-RAT* ) 


MiteonJ; grp75 


Positional homology with human through 


905.67 


5.41 


413.67 


71 589 








coeiectrophoresis 










175, 251 


NCPR.RAT 


NADPH P450 reductase 


2-D of pure protein 


824.69 


5.29 


393.21 


75 366 


1168, 1170, 


PDLRAT 


PDI: Protein disulfide 


^-Terminal sequence (R. M. van Frank), Ab 


564.30 


4.83 


528.47 


55 618 


1171 




isomerase 












47, 93 


ALBU.RAT 


Pro-Albumin 


Microsomal lumen location, p/, M, relative 


1391.03 


5.99 


446.68 


66 195 








to albumin 










236 


APA1.RAT 


Pro-APO A-I lipoprotein 


Coeiectrophoresis with plasma protein 


920.41 


5.43 


1137.51 


23 467 


320 


IPK1.BOVIN 


Protein kinase C inhibitor 1 


Internal sequence; homology with bovine 


1480.01 


6.08 


1458.81 


17 007 


152 






protein 










PNPH.MOUSE 


Purine nucleoside 


internal sequence 


1507.19 


6.10 


911.16 


30 599 






phospborylase 












1179, 1180, 


PYVC-RAT ) 


Pyruvate carboxylase 


Tentative; 2-D of pure protein (J. G. 


1485.10 


6.08 


22332 


131 589 


1181, 1182, 






Henslee, JBC, 1979); reported in Biochim. 










1183 






Biophys. Acta J022, 115-125- 










55, 103 


SM30.RAT 


SMP-30: Senescence 


Internal sequence 


721.71 


5.11 


830.10 


34 051 


135 




marker protein-30 












SODC.RAT 


Superoxide dismutase 


AAA; comfirmed by internal sequence 


1 161.24 


5.74 


1388.68 


18 173 


172 






(R. M. Van Frank) 










TPM-RAT' 


Tm: tropomyosin 


Location in cytoskeleton, 2-D position 


476.24 


4.66 


957.86 


28 865 


277, 56 






relative to human, Ab 










TBA1.RAT 


Tubulin a 


Positional homology, with human through 


688.22 


5.06 


537.67 


54 620 


50, 1225 


TBB1JIAT 


tubulin 0 


coeiectrophoresis, cytoskeletal location 










Positional homology with human through 


621.29 


4.93 


535.48 


54 855 


1224 






coeiectrophoresis, cytoskeletal location 










VIMEJWT 


Vimentin 


Positonal homology with human through 


673.00 


5.03 


53930 


54 426 



coeiectrophoresis, cyiosxeletal location 
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Table 1. continued 










MSN" Protein IDb) 




Identification comment* 


Gel JT 


Experimental Gel Y* 1 Experimental 
p/* 1 Af," 


104 BBPL.RAT 


?: not is sequence 

databases 
23 kDa morphine-binding 

protein 


Internal sequence 
Internal sequence 


1191.28 
773.31 


5.78 680.42 42 469 
5 JO 1182.41 22 363 


a) Master spot number (MSN) from [1] " 



b) SwissPROT identifier 

c) Coordinates of the most basic or most abundant assigned spot on the F344 master gel pattern 

6 l I 1 it ° f mWl b " ic or mo$l aDUDd4nl "signed spot, derived from the calibration functions included here 
e) SwissPROT style proposed identifier 

Abbreviations: AAA. amino acid analysis; Ab, antibody 



Table 2. Equations and coefficients 
Function 



Equation (0 



r2 



Rat gel Y - ftcomputec* .i/.i v - a * Aexp(-x/c) 0.988181021 
Rat gel X « flcorapuied p/j > = a -r Ax a/lor - dfx + efx 1 - 5 0.99247216 
Computed M T « ftrat gel Y) v - a -r Axe 0.9960177 
Computed p/ « ftrai gel JT) « a + Ax + ex 7 + ^x 2 Jnx + ex 3 0^99176499 



178.74803 1967.7892 32363.958 

-8685665 J -904497.94 3856926.1 

-$464.5809 19095881 -0.9086255 

4.044686 -0.001 14238 0.0000323 



18276844 -27154534 
-0.00000455 0.00000000 1 76 



Mouse gel Y ~ flrat gel X) 

Mouse gel X • firai ge! X) 
Rat gel Y - fl mouse gel Y) 
Rat gel X » flmouse gel JT) 



y m. o + Ax + CX*- 5 + rfx" Iqx + 

"/lax 0.99951069 

> - a + Ax 3 Inx + car" + dx 3 0.99926349 

.y * a + Ax 2 lax + cr 2 - 5 + rfx 3 0.99950032 

>■ - fl + Ax + ex 7 lax + rfx 3 - 5 + ex 5 0.9992832 



11861.44 678.91666 

58.935923 0.00091353 

69.740526 0.00050772 

-198.07189 2.0899063 



-0.78964914 15673639 -6953.9592 

-0.000213688 0.00000159 

-0.000130392 0.00000116 

-0.000671191 0.000145189 -0.000000986 



ysa+bx+cx/lnx+d/x+B/x^l .5) 




B 



y=a+bexp(-x/c) 
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Figure 2. Plots showing fits of selected equations (continuous curves) to data oo identified proteins (square symbols). (A) p/ computed from 
sequence data versus gel X position for identified spots in F344 nt liver; (B) M r computed from sequence data versus gel Y position for identified 
spots in F344 rat liver, (C) gel X position for spots in B6C3F1 mouse liver versus X position in F3443 rat liver, for coelectrophoresing spots; (D) 
gel r position for spots in B6C3F1 mouse liver versus Y position in F3443 rat liver, for coelectrophoresing spots. In each case, inverse equations 
were also computed (Table 2). 
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5S Zl^^l 2 '° ge ' pallsrn for B6C3F1 mouse liver - standardized using the F344 rat liver pattern identifications, according to the method 
described in the texi. Twenty-nine proteins are identified. 



P'raTPLASMA - /RAT LIVER X-pl (/raT PLASM A * LIVER X— RAT LIVER X 

(/ratplasma x— rat plasma ♦ liver x C^rat plasma))) 

(8) 

This unified approach, in which one well-populated 2-D 
pattern is used to standardize a family of other patterns, 
has the additional advantage that the resulting pi and M t 
scales are directly compatible. Hence one can compare 
the relative pfs of mouse and rat versions of a se- 
quenced protein in a consistent pi measurement system, 
and select likely inter-species analogs based on posi- 
tional relationships on common scales. Adoption of 
immobilized pH gradient (IPG) technology [4-7] will 
result in . substantial improvements in pi positional 
reproducibility for standard 2-D maps such as those pre- 
sented here; however, we believe that our approach will 
continue to be useful in establishing the empirical pH 
gradient actually achieved by such gels under given 
experimental conditions (temperature, urea concentra- 
tion, etc), in relating patterns run on different IPG 
ranges and using different lots of IPG gels (between 
which some variation will persist). Development of 
rodent organ maps is a continuing effort in our laborato- 
ries [8-10], and results in regular additions of identified 
proteins. Those who wish to receive current rodent liver 
maps, with color annotations, should send a stamped 
self-addressed envelope to the first author. 



We would like to thank the individuals who provided anti- 
bodies mentioned in Table i t and R. M. van Frank for un- 
published sequenced data. 
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identify all cDNA species, and the approach does not easilv allou a wwemaiic 
screening. Analyse of gene e/.pression by the study of proteins present in a cell or 
tissue present* a favorable alternative. This can be achieved by use of two-dimensional 
i:-D> gel electrophoresis, quantitative computer image anjJysts. and protein idemifi. 
cation techniques to create 'reference maps* of all detectable proieins. Such reference 
maps establish patterns of normal and abnormal gene expression in the orcanism. and 
allou- the examination of some post-translational protein modifications which are 
functionally important for many proteins. It is possible to screen protein* svstemati- 
cally from reference maps to establish their identities. 

To define protein-based gene expression analysis, the concept of the proteome- 
was recently proposed fWilkinse/a/.. 1995: Wasingere/,//.. 1995). A proteome i* the 
entire PROTem complement expressed by a genOME. or bv a cell or tissue tvpe The 
concept of the proteome has some differences from that of the genome, as while there 
is only one definitive genome of an organism, the proteome Is an entitv which can 
change under different conditions, and can be dissimilar in different tissues of a single 
organism. A proteome nevertheless remains a direct product of a senome Interest, 
ingly. the number of proteins in a proteome can exceed the number of genes present 
as protein products expressed by alternative gene splicing or with different post- 
translational modifications are observed as separate molecules on a 2-D cel. As an 
extrapolation of the concept of the genome project", a -proteome project' i* research 
which seeks to identify and characterise the proteins present in a cell or tissue and 
define their patterns of expression. 

Proteome projects present challenges of a similar magnitude to that of senome 
projects. Technically, the 2-D gel electrophoresis must be reproducible and of hi-h 
resolution, allowing the separation and detection of the thousands of proteins in a ceU 
Low copy number proteins should be detectable. There should be computer eel ima«*c 
analysis systems that can qualitatively and quantitatively catalog the electrophoreticaHv 
separated proteins, to form reference maps. A range of rapid and reliable technique's 
must be available for the identification and characterisation of proteins. As a con.se- * 
quence of a proteome project, protein databases must be assembled that contain 
reference information about proieins: such databases must be linked to scnomic 
databases and protein reference maps. Databases should be widelv accessible and easv 
to use. 

Recently, there have been many changes in the techniques and resources available 
for the analysis of proteomes. It is the aim of this chapter to discuss the status of the 
areas outlined above, and to review briefly the progress of some current proteome 
projects. 

Two-dimensional electrophoresis of proteomes 

Two dimensional (2-D) gel electrophoresis involves the separation of proteins by their 
isoelectric point in the first dimension, then separation according to molecular weight 
by sodium dodecyl sulfate electrophoresis in the second dimension Since first 
described (Klose. 1975: OTarrell. 1975: Scheele, 1975). it has become the method or 
choice for the separation of complex mixtures of proteins, albeit with manv modifica- 
tions to the original techniques. 2-D electrophoresis forms the basis of proteome 
projects through separating proteins by their size and charge (Hochstrasser ei al 
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: d gel resolution and reproducibility 

A primao- challenge of separating complex mixtures of protein* bv M | ,u. 
Phoresis has" been , 0 achieve mgh resolution and repriuSty i a h l 'f , 
ensures that a maximum of protein species are ^^J,^ 
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vital 10 allow comparison of gels from day l0 dav and bei»~ n res -ar-h si,, t, 
factor, can be difficult to achieve. " ,C> Th?N? 

C-rner ampholyte? are a common mean, of isoelectric focus,,,- for th- f,r , 
dimension of 2-D electrophoresis. Gels are usually focuseJ to equilibrium to .-narate 
protein, .n the pi range 4 to 8. and run in a non-equilibnum m.,de (VEPHGEi to 
sepaiate proteins of higher pi (7 to 1 1.5, tOTaxrell. 1975: OTanell Goodman and 
0>,rrell. 1977,. Un/onunatc.y. the use of carrier ampholv.es t ^TccZ- 
focus.ng procedure i< susceptible to cathode drift', whereby pH gradients establ.sheu 
h> r ^focusing of ampholyte, .lowly change with time (Rinheti, and Drv .dale Q- 
Carr.er ampholyte P H gradient., are also distorted by hich >a,t concentration of 
sam r :e.,<B,eIlqvi,t ,/„/.. 1982 Land by high protein load (OTarn. II. 197*. a further 
limitation is that iso electric focusing gels, which are east and subject to eieetrophore 
^ in narrow glass tubes, need to be extruded by mechanical mean> before application 
to the second dimension - a procedure that potentially distorts the gel. Nevertheless 
many of the above shortcomings can be avoided by loadinc small amounts of ,J C or "s 
radiolabeled sample, (Carrels. 1 989: Neidhardt etaL 1989: Vandekerkhove «„/ 
1990,. High sensiuvny detection is then achieved throufih use of fluoro-raphv „ r ' 
nhosphonmagms plate, (Bonner and Laskey. 1974: Johnston. Pickett and Barker 
1990: Pat,er,on and Latter. 1993,. However, this approach ,., on.v practicable"; 
organism* or tissue, thai can be radiolabeled. 

An alternative technique, which is becoming the method of choice for the first 
dimension separation of proteins, involves isoelectric focusing in immobilized pH 
gradient . IPG . gel, . Bjellqvist e, aL 1 982: Gorg. Postel and Gunther. 1988 Ri^heni 
1990,. immobilized pH gradients are formed by the cova.ent coup^S 
gradient into an acrylamide matrix, creating a gradient that is complctclv stable with 
ume. IPG gel, are usually poured onto a stiff backing film, which is niechanicallv 
strong and provides easy gel handling (Ostercren. Eriksson and Biellqvist 1 9 88 , The 
major advantages of IPG separations are that they do not sufferfrom cathodic drift • 
they allou focusmg of bas.c and very acidic protein, to equilibrium. pH srradicnts c m 
he precisely u.lored (linear, stepwise, sigmoidal). and that separations' over a verv 
narrow P H range arc possible (0.05 pH units per cm»tRi c hetti. 1 990: Bjcllqvis, ,, J 
.982. 1993* Stnha ,,«/.. 1990: Gorg „ „/.. 1988: Gelfi <v <//.. .987: G Jher ^ 
1988,. Houever. it »s not currently possible to use IPG seN to separate xerv h-,si • 
protein, of isoelectric pom. greater than 10. although th.s under development 
Narrou pH range separa.ions are useful to address problems of pro.ein co-mi-raiion 
m complex samples, allowing 'zooming in" on rccions of a «el iFitwtv IPG -el 
strips are now commercially available, which begin tn address the problems of inl- 
and inter-lah isoelectric focusing reproducibility. 
There are two means of electrophoresis for'the second dimension scparat.on of 

^! n S^ n,Cal S,a M Fe cne Bld hO^ZOnla, Uhralhm ?CK ,GOr - PoS,d - and 

1988,. Both are usually SDS-containing gradient gels of approximaielx 1 1 Tr to I S <7, 

acrylamide. which separate protein, in the molecular mass ran-c of 10 - l5()kD A 

stacking gel i, noi usually used with slab gels, but is nccessarv when usin- horizontal 

gel setups (Gorg. Postel and Gunther. 1988,. Comparisons have shown ma, there i, 

mle or no difference ,n the reproducibility of electrophoresis using either approach 

(Corbet, ,,/.. 1 994a,. but commercially available vertical or horizontal precast -eh 

will prov.de greater reproducibility for occasional users. For slab eel electrophoresis 




F.pure . Tw,..d.mens.nnal pel electrophoresis allow, -jnommp .n «,n area* ..I .meres, R ffl ., h.-hl.-h, 
. pro.e.ns common to each eel (A. W„Je pi ranee iw« dimensional cleurophorcMs map <.r human f-Luma 
proteins First dimension «parai.nn was arhe.vcd us.np an .mmor.,|,sed P H md.cn. »f J.5 ... Ill 0 „„,,, 
Tne second d.mens.r,r, was SDS-PAGE. Acual pel s.« wa. Iftrm x :< trr , and protc.m were mud^d 

riasma map The first dimension used a narrow ranee immomi.sed P H crad.cn. of J .2 i„ < -* U n..v and 
second dimension w« SDS-PAGE. M.croprepara.ive ioadinc was used, and .he eel hlnticd ... PVDF 
Proteins were visualised with amido black Actual blot size w 3 , | rem K 2u^m ' 

the use of piperazine diacrylyi as a gel erosslinker and the addition of thiosulfate in the 
catalyst system has been shown to give beuer resolution and higher sensitivity 
detection (Hochstrasser and Merril. 1988; Hochstrasser. Patchom'ik and MerriL 

I 70o). 
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Notwithstanding the advance described alrve. there is in increasing d-nnnd , 
improve the reproducibility of 2-D electrophoresis » facilrate daabaJ^^ * 
und proteome studie, Harrington „ al. (,993, explain that if a eel ^oh^^ 
protein spots, and there is 99.5* spot matching from pel in aL ^ tfiW produced 
spot error, per sel. This amount of error, which might accumulate w.th each «eJ «o ~| 
companson used in database construction, could produce in unacceptable J-re- of 
uncertainly in gel database*. To address these issues, panel automation of I^Xd 
sel separation, has been undertaken (Nokihara. Moritaand Kuriki 1 99-- H^n«T,« 
et al.. I 99? >. Althoueh results are preliminary <nm ,« «1 , Hjmn ? ,on 
in one study was found to be threefold S^^^S^^^^ 
al ,9931. It should be noted that small^-D ^S^^^^" 
almost completely automated (Brewer et aL ,986). althoush these are not 'enerJl v 
used for database studies. " a>cniran\ 
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MICROPREPARATIVE ;_D GEL ELECTROPHORESIS R 

With the advent of affordable protein microcharacteri.sation techniques, including N 
terminal microsequeneing. amino acid analysis, peptide mass fincerpriminn pho nnate* 
analysis and monosaccharide compositional analysis, a new challense 
phoresis has been to maima.n high resolution and reproducibilm but'to S 
protein in sufficient quantities for chemical analysis Mfi^omJZ&SS^ 
quantities of proteins per spot,. This becomes difficult to achieve with vrrv e ~ ft 
.arnp.es such as whole bacteria, cells, as the mitial protein lot^ drt^2a^ 
to 4000 protein species. Two approaches are used for product amc^^r^S 
that can be chemica ly charactensed. The first method is ,o n»'n«*ip?a3^Sc1 
and pool the spots of interest, and subject them to concentration ( Ji a al 1994- Walsh 
««/.. 99,:Rasmussen. fl /.. , 992 , In this approach, the concenuaiic;^ .^ 
also aci as a punfieanon step to remove accumulated electrophorctic con, 7™^ . ■ 
such glycine. A more elegant approach has been to expIoh^c^^S; 
of IPG isoelectric focusing. The high loading capacity of immobilised P H ^eml 
was described early ,Ek. Bjellqvis, and Righetti. .983, bu, has onK rcc"^ bcC n 
^'*»°^elecir^^ 

mg of protein can been applied to u single gel. yielding microgram quantise of hun 
reds o, prote,n speces. A f unncr bcncfl! of th „ a - 

ovv abundance^ may no, be v,sual,sed by lower pro.e.n loads, arc „ lo 7e like v 
to be detected. The use of electrophoret.c or chromatographic prefrucuonanln 
n.ques (Hochstrasserr,*/.. ,991a: Harrington „„/.. iwrfollo^hl^ 
of narrow-range IPC separations , Bjel.qvist „«/.. 1 993h,p^ 
studies on proteins present in low abundance. ' 0,U " 0n IO 

.Methods of protein detection «i n 

^cZT^T m " nS ^ deleClin?pr0leinsfrom: - D ?^ The method used wi,l be eXi - 

dictated by factors mcluding protein load on gel (analytical or preparative, thl 19? 

purpose of the ge, (for protein quantitation or for blotting and cheLcT^a actcris^ m 

«.on). and the sensitivity required. The most common means of pSetao^S^ <* 
«he,r applications are shown in Table 1. Most detection methods'ha e dra* backs for 
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no proteii. stain is able consistently to detect proteins over a \\M: ran$c of concentre 
t-ons. isoelectric points and amino acid compositions, and with a varies of 
posi-trupsiauonal modification* iGoldbcrc eiaL 1985: Li a al.. ] 989 1. Furthermore, 
there are large differences in staining paiteiii when identical pels or blo.s are subjected 
m differ mi stains, including amido black, imidazole zinc, india ir.k. ponceau S. 
colloidal sold, or coomassie blue iTovey. Ford and Baldo. 19K7: Oni2 a a!.. 1992) 
The mo t common means of quamitating large numbers of protein, in a 2-D «*el 
involves ths radiolabclling of protein samples'prior 10 eleetrophoresiv jnd protcm 
quanria.ion based on fluorography and image analvsis or liquid scintillation countin- 
(GarreK 1989: Cells and Olsen. 1994i. However, proteins which do not cental 
methion..ie cannot he delected if only ("S] methionine is used for la^ellinc. Ammo 
acid analysis of protein spots visualised by other techniques presents a likely means of 
protein quantitation for the future. 



BLOTTING OF PROTEINS TO MEMBRANES 



Electrophoretic blotting of proieins from two-dimensional polyacrylamide eels to 
membranes presents many options for protein identification and microcharacierLuion 
which are not possible when proteins remain in gels. For example, when proteins are 
blotted to poly vim lidene difluoride ( PVDF l membranes, ihey can be identified bv N- 
terminal sequencing, amino acid analysis, or immunoblotting. or they may be subjected 
to endoprotemase digestion, monosaccharide analysis, phosphate analysis, or direct 
matrix-assisted laser desorption ionisaiion mass spectrometry (Matsudaira. 1987: 
Wilkinsm//.. 1995: Jungblut <•/«/.. 1994: Sutton etal.. 1995: Rasmussen citii.. 1994^ 
Weizihandler ct al.. 1993: Murthy and Iqbal. 1991: Eckerskorn ci al.. 199;Y h j s 
possible to combine of some of these procedures on a single protein spot on a PVDF 
membrane < Packer ci al.. 1 995: Wilkins ci al.. submitted: Weizthandler ct al.. 1 993 1. 
This js useful when minimal amounts of protein are available for analysis. These 
techniques will be explored in detail later in this review. Notwithstanding the above, 
there are some disadvantages associated with blotting of proteins to membranes! 
There is always loss of sample during blotting procedures i Eckerskorn and Lottspciclv 
1993). and common protein detection methods are less sensitive or not applicable w> 
membranes [Table J), presenting difficulties for the atuKsis of low abundance 
proteins Detailed discussion of ihe merits of availahle membranes and common 
blotting techniques can be found elsewhere ( Eckerskorn and Lottspcich 1 99 V Strupu 
vial.. 1994: Patterson. 1994i. 



2-D gel analysis, documentation, and proteome databases 

Following protein electrophoresis and detection, detailed analysis of gel imaces is 
undertaken with computer systems. For proteome projects, the aim of this analvsis is 
to catalogue all spots from the 2-D gel in a qualitative and if possible quantitative 
manner, so as to define the number of proteins present and their levels of expression. 
Reference gel images, constructed from one or more gels, form the basis of two- 
dimensional gel databases. These databases also contain protein spot identities and 
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GEL IMAGE ANALYSIS AND REFERENCE GELS 

After :-D electrophoresis and protein visualisation bv win,** n 
phosphorimaging. images of gels are digitized for * nuoro S*phy or 

scanner. laser densnomer. or charge ^o^ ^/S^'"^ ™ 
Celts ct aL 1990a: twin and Jack™ ^7993, /, ^ ''T {QdncK l98 * 

or more grey scale," ,. Follow^ this, ge, image, STSSSJT * ^ 1256 
pulation, ,0 remove vertical and horizontal "iT* °' n,aBi - 

<pot postnon. and boundaries and to calculatT spoUnTen^ ^ '° de,ec ' 

spot (SSP, number, contains ven.cal m th^T^ y ^ J) - Anmda,,i 
assigned ,0 each detected spot and bee^^^SS ST"* m{0rmM ™- » 
l«« -me notable software packages ^^^^ ' 



Table 2: Some Software Packate* for ihe Analysts of Gel Imaces 
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TVCHO A: KEPLAR 
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^'•crrxr i,-,//, /,f„, mw pn ., cax :g 
CALCULATION Or PROTEIN ISCILHCTRir POINT AND MOLSCllAR U-EJCHT 

Estimation of the isoelectric poim (pi: and molecular weishi (MVV, of nrot-in f m 
2-D cel. p:ovides fundamental parameter* for each profem. which arc ako H Z 
during identification procedures see f jllou me action i. The pi and M\V of nroie.ns 
are recorded in 2-D eel database Accurate estimations of protein pi and \|W can he 
obtained by using 20 or more known pr Min , on a nfennee map f0 C(Jn<lruei 
curves of pi and molecular weight, which are then used to calculate est.mat.d P J Mui 
of unknown proteins (Neidhardt a aL I9S9: Garrels and Franza I9S9 V-m 
Bogelen. Hutton and Neidhardt. 19%: Anderson and Anderson. 1991- *nd« ™ , 
uL 1 99 1 : Latham „ aL 1 992 ,. Alternatively, the MW 0 f individuaVpro'^^ MoT, Z 
«o PVDF can be determined very accurately b> direct mas, spectron^ Eckc^ 
cr aL 1992,. Where immobilised P H gradients are used, the focusin. nos,, o„Tf 
protein, allow, the:r pi to be measured within 0.15 units of that calculated from ,1 J 
ammo ac.d sequence ( Bjellq vis, c , aL 1 993c 1, must be noted, however, that protein! 
carry me posi-iranslational modifications may migrate to unexpected pi 0 \ , "v 
positions during electrophoresis (Packer ei aL 1995). 

SPOT QUANTITATION AND EXPRESSION ANALYSIS 

A major challenge faced in proteome projects is the quantitative anahsis of nroie.rw 
separated by 2-D electrophoresis. The most accurate means of protem qua m,^ 
to determine chemically the amount of each protetn v^^ ZS^^ 
positional analysis. However, the current method of choice for quantitative analvsis 
of many proteins ,s to radiolabel samples with [»S] methionine or "C amino adds 
perform the _-D electrophoresis, and measure protein levels in disintegrations per" 
minute ,dpm, or units of optical density. Quantitation is achieved cither hv liuuid 
vcinii htion counting, or h> gel image analysis uhere spo, densities arc qua'n,,, „ed 
b> reference .o gel calibration s.nps containing known amounts of radiolabeled 

Prn T n ITT ** ,mS - Cra,ed ° P,,Cal dens "- V 0f * ^ *^«> < Vamlckcinm-e 
« a! 1990 : Celis „ „/.. ,990b: Celis and Olsen. ,994: Cartels. 1989 ^™ 
Carrels and Soher. 1 993: Fey „ ,994,. All approaches effea.ve.v allow spo ' i 
ne normalised aga.ns. the lotal disintegrations pcr m ,nu«c loaded onto , h , „ cl 
Limitations that rema.n with rad.olabellmg methods are that absolute quantitation is 
no, achieved because a„ prote.ns have varying amounts of any am.no aJ n n 
only easily labelled samples can be investigated. Quantitative silver sta.n.n, pre" „ 
an alternative .Cornell, c, aL 1991: Harrington a aL 1992. Rodr^uez <•,'„/ 199 • 
yr,ck ,,, 19931. which when undertaken w„h PSJth.ourca , WaHace a d L L' 
I ^y. j.bi is of extremely high sensitivitv. 
When protein spots from samples prepared under different conditions are quantita.ed 
d matched from gel ,o gel. i, becomes posstble ,o examine chanees and patted n 
Pro em expression. Urge scale invesugation of up- and down-regulation o? p oL s 

S ZST 6 and l d,SaPPC:ir:in "- « be und -^ n - — >• simianno, 40 
tZlZT keral,noc > «™ ^own l0 have ) 77 up . recula , ed and 58 f|ow 
regulated proteins compared to normal keratinocy.es (Celis and Olsen 1 994 , detailed 
synthesis profiles of 1 200 proteins have been established in I to 4 cell mouse emb^ 
•Latham a aL ,99,. 1992,: and 4 proteins out of ,971 were ioZ he be marked 
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cadmium lOMcny in unnan protein< rMyrick etaL 1993). Coirr-Iex clotal chan-e* 
in protein e?.prs«ion as a result of sens disruptions have also been inveiicaied i S Ftv 
am' P Mos. -Lar>en. Personal communication j. Impressively, larce sel set. showing 
pro.em expression under different conditions can be globalh ~invesn;aied usin* 
Mat -Mical n ethod* that find groups of related objects within a set. For example the 
REF52 rai c.-ll line database, consisting of 79$els from 12 experimental eroup< where 
cadi pel c oniam. quantitative data for 1 600 crossmatched protein, ha. h cen a nalv<ed 
hv clu^r analyv, (Carrel, e, «/.. 1990). This revealed clusters of protein, that for 
example, v ere mdueed or repressed similarly under unian v.rus 40 o, adcnov.ru. 
iran.formu-.on. . uff ee«iing a common mechanism. Prote.n groups thai were induced 
or reprfvseo Junnj culture growth to confluence were also found. It is obv iou* thai ite 
pmenual for investigation of cellular control mechanisms bv these approaches u 
.mmen.e. 1. is equally clear that investigations of gene expresston of this scale are 
currently technically impossible using nucleic-acid based techniques. 
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and YcjM Eleetmphorctic 
Protein Database (YEPDi 



Cci vpoi* imkcb unh CenBank 
and Kohara clones; quantitative 
spot measurements under differ- 
em irrcmih conditions 

Identification ot disease markers; 
two separate database* have 
been established 

Extensive identifications: 
quantitative spot measurements 
of iransicirmed cells: idenuftca- 
mm ol disease markers 

Quantitative spot 
mravuremenis through 
1 in - cell stasrc 

Documents chances due to 
exposure to toni/tnc radiation 
and m\u chemicals 

Dcuilcd subcellular 
fractionation studies 

Extensive studies on regulation 
ot proteins h\ drups and tout* 
acents 

Accessible via World Wide Web 
quantitative spot measurements 
under diftercni conditions 
Accessible via World Wide Web 
completeK mtecrated « tth 
SWISS-PROT and 
SWISS3DIMAGE 
CompicteU crossrclerenced 
orsamsm database. YPD has 
extensive information on over 
J50O proteins: YEPD has 
manv identifications 
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Progress unit /vnimmc /w^f, /, 
FEATURES OF PROTEOME DATABASES 



Proteome protects rely heavily on computer database* to More information about all 
proteins expressed by an onanism. 'Proteome databases' should contain detailed 
information of protein* already characterised elsewhere, a* udl a< protein data from 
:-D gels such as apparent pi and MW. expression level under different condition*, 
subcellular localisation, and information on posMianslaiional modifications. lm;tcc> 
of reference 2-D eels. sScwing protein SSP numbers and protein identifications, 
should also be included. Ideally, proteome databases should be accessible with 
Macintosh or IBM persona computers and easy io use. Some proieome databases and 
the areas they cover are l^ied in Table * Databases range from collections of 
annotated eeU to large databases of images integrated with protein and nucleic- acid 
sequence banks. 

One example of an integrated proteome database is the suite of SWISS-PROT 
S WISS-2DPAGE and S W1SS-3DIM AGE databases i Appel ct aL 1 993; Appel a al ' 
1994; Appel. Bairoch and Hochstrasser. 1994; Bairoch and Boeckmann. 19Q4>. The 
features of these three databases are listed in Table 4. SWJSS-PROT. SWISS- 
2DPAGE and SWISS-3DIMAGE are accessible through the World Wide Web 



Table 4: The SWISS-PROT. SWISS-ZDPAGE and SWISSODIMACE suae ..i crocked daub K , 
All tnre? dataruise* are accessible tnrouch ihr World Wide Wch. at URL address; hup:// 
expatv hcuec.cn/ 



SWISS-PROT 



SWISS-ZDPACE 



SWISS-3D1MACE 



Inionnauon 



Text entries of sequence data: 
Citation information: 
taxonomic data. 38. W 
entries in Release 2V 



Annotation* 



Protein function. 
Pom translauonal 
modifies tons. 
Domains: 

Secondary structure. 
Quaternar\ structure. 

DlsCJ.\CS associated 

w uh protein. 
Sequence conflict* 

SWISS-2DPAGE 
SW'!SS*?DIMACE 
EMBL. PIR PDB. 
OMIM. PROSITE. 
Medline. Flyhase: 
GCRDh. MaueDB. 
Wonn Pep. DtctxDB 
Other Features Navigation to other 

SWISS databases achieved 
h\ selecting c nine* u ah 
computer mouse 



Croi*. 

Referenced 

D.*uaha*e$ 



2-D eel imaces of* human 
liver, plasma. HepG2. HcpC2 
secreted proteins, red Wood cell, 
lymphoma, cerebrospinal fluid, 
macrophage like cell line, 
crwhrolcukemia cell, platelet 
Gel imaees where 
prnicin »x lound. 
Hou protein identified. 
Prnicin pi and MW. 
protein number: 
nnrmal and pathological 
variant* 



SWISS-PROT and all 
other databases 
accessible through 
SWISS-PROT " 



Gel imaees thou, position 
of identified proteins, or 
region of eel where protein 
should appear 



Collection of VV) .%-D 
imaces ot proteins 



All annotation t * 
available in SWISS- 
PROT 



SWISS-PROT and all 
other databases 
accessible through 
SWISS-PROT * 



Mono and stereo 
imaces available. 
Imaces can be 
transtcrrcd to local 
computer image 
vtcwinj programs 
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( Beuers-Lee e i al. 1 992 j. alloumc anv comn 

the ...ored .nformation and ,m aMs . k^^rT.T * ** memei 10 
is ramies, as all potenual eroVslinks are hi^nl^d a W ^ "* *" 
car be ^elected unh a computer mouse. From h- ! iLh 0n disp,J - v and 
ah< u« a protein, including am.no acid sequence Ind LZ ' d?,a ' ,Cd ,nfo ™»»on 
canons, can be obtained, the precise proihT^?, P™-»™*Iat,onal modifi. 
.m;,e can be v,ewed if knou^. ZL^fZt 1° °? * " *' 

available. Reference, to nucleic acid and oiH^Z^l * SK " if 

access to information stored elsewhere, are j.s 0 given to provide 

Organism' databases, containing dmiu 

. species. ar e becoming To^T^l?' Md * '"formation 
The>e differ from nucleic ac.d or protei iL^L^ '"l**™ PT °^ 
PROT because they are ,ma 8 e ^^^^ b ^ Cie ^^l5^ 
map positions, transcription" of !e nes an 1" \ mfom "™ *out chromosomal 
rtenchia col, gene-protein datab^V^B^ expression patterns. The £,. 
VanBo,elen and Neidhardt. ,99? Vanlw? T Neidhar *- J9 *>: 

EC02DBASE « one example 1, coml^Jf''^ - «he 

information deluding pi and MW estim^' d £Z ^ 2 ' D * el ^ 

mation (GenBank or EMBL codes chromosom^ ^ Kie " ,,fica,,on »■ *™«ic infor. 
.Kohara. Ak.yama. and l^TTw^f^SS^^-^i^ °" Kohara cl ~* 
regulatory mformauon (level ofpro ^^T TTr ° f ''^ and P™«" 

member of region or si.mu.on!' Tn£^^~ ^e, _ 

referenced to the SWISS-PROT database (Bairort and iof,f ^ ^ 

anticipated that onanism databases will soon he^T B , oeckman n- 1 994 1. h is 
available informal about a p^l^ ^ 

consistent manner in which orgLm ^^Z^ "° — 

comparisons in the future. hemmed, which ma\ hamper 
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Iden.inca.i u „ i i„dch a r a cleri S a.ion«fpr» W „ 5froIn ,. Di!( . |s • of* 

ThtnumhtrofproleinMd;niintdonii:.Dreferencem a n,k,.'" ""^ 
a research and reference lool A, „ referent" 1 P "elernunes ,,s usefulness as 

proteins identified, a ma.or a m to ^ ™, Zc^T " ' " S " U " pm f" ,r "»'> •* 

from :.D maps. ,n order ,o define hem a ?ZZZ ST""* ' M ° 
databases, or a. unknot. Pr01 * JT? "™» "»*ic a„d and r r,„ c ,„ 
open readtnj frames, and provides f«u £ o ™ ™ f ' m ™"" »' DNA 

charactemaiton effons hv pomim- ,o nr™e,ns ,L ' 5 P "'" :CIS alul 
.VKKL40Wpro,einsfromas,n.ie"r D m^ h 7 ^ "" VCl S,,,ct ,h;rt '«» he 

prn,e,n screen.n, is «„ ^2 "'"'k U« chaHenL-e in «!™' 

Tradiiionatlv. proteins frorn ^>^»e^ ^ k a n,ln ' lllum of cosi and effort. ^ 

«m.?ration of unknown pro,e,„ s ^LT^r L T/ , " ,,,e ! WH,f - 
H,erarch.ca I a P proac h ,o m as 5 p,o,e.n,de„tir.c a ,io„Lr„rtc:„:;;!^^^^^^ Z 
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Table 5: Hierarchical analv<i.< for mass, screenmc «r • n 
Rap.d and .nei-rr.«„ e tec iB ' Iuue * arc » J TaTI f,«, ?, " SepaWed rm «»» i.. men,*. 



Ordr: Jde nufcaunn tr^nmouf 

I Amino acio ana >>n 

: Ami*, awd a*ul . w UI ,h Vterrninal sequence u ; 

3 Pepiide.ma*^ iir ^erpnmmc 



Comhjnaiinn »i amino acid anahsit and pepud- 
ma** finffrrpnniinf 

Ma« spe;:tromeir\ sequence ta£ 

Extensive N-terrmnal Edman micnKequencinf 

Jnternai pepude Edman mirn>^queni;ing 

Mitweouencinjp K> ma<< spenromem .electro. 
*pra> mniuiinn. pnsi-Miurce decav MALDI-TOF* 
Ladder teuuenemg 



Reference* 

JuncMm n «/.. jwu^. si uu |ou: 
Hohnhm. H.whaeu- anj .Sander ^ 

* iikw* ct «/.. * U hmnicd 
Hen?el^/«/.. iw*. p. tfS __ 

Mann.Ho,ru rJ ndR. vrMtirlI |U4l , 
^r/,,/.. w.v Miin/rJw ; ^ • 
.Suiion r;«/.. |vy5 

Cordwefl <■/«/,, 1995. 
**a*m$er r/ «/.. ,995- 

Mann and Wii m . ivvj 
MaiMjdaira. ,VK7 
Rotenfeid r; aL 
Hcliman r/ n/„ |«V5. 
J»hn*on and Wj«h. | 

Banlet-Jiincv r; »/.. 



altcrn Jtl v e ,oir a du,onal a pprouch C s(7- < ;/,/c5:\V a sin 2e r„ </ / I99 „ Th - . 
use of rapid and cheap identification tools such as amin " , , Ttm,nv <^csthe 
mass fingerprinting as fir* steps in protein ^f^^'^ and ™ ,id < 
dower, more expensive and tin* icon^^mSS^ ^ * * hC Use of 
the construction of this hierarchy the an S im C coTZ if nc «^- In 

of the data created has been 'am^X^^^^^^y 
machine t.me P er sample. Ine analvri ^^^.To ' iU,e 
coaming. Amino acid analysis and peptide ma" fmo^ * .l"* 0 '* 011 and ,in * 
technique, in the h.erarchv are discu^St^^T^ ^ ^ficaiion 
■dentTicon techniques ,n Table see Fa^^^' M^T P, " ,Ci " 

WOTSIN IDENTIFICATION- B V AMINO ACID COMPOSITION 
There ha> been a revival of interest in the us* *r 

.dentification of proteins from :-D eels after c^Zr^Z^ '^T* for 
This technique uses a pro.e.n\ idiosyncratic amino JS Etkrr ^' ,rn « «/. , J98K,. 

identify it by comparison with Lo^cal coZo composition profile ,n order 
The ammo acid companion of pro^cM he^ -I in 
-diCabeUin, and quantitative^ ^ n 

"I- IW4:Frey«a/.. J994, orbvucid hv^.v ^ ' - D eleciro P h ^esiv .Garrcls {7 
chromatographic anah,, ci ^^^ZlT^^ ™< 
'°88:Tou»« fl /.. 1989:Gharahda" h «7 aC,d w m,XI,,re ,E ^,kom « „/.. 

1995,. As differential metaMic ^%l^'- im * lWe,al " m ™Mnse,„L 
pho,,mace p,ate exposures of p ifSSA^^fTl* ^ fi,m ° r 
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spa: zzzL:~r.y. 

ugitsssssrss: 



Asac: 15.2 Zlx: 1C.4 Ser: 5 .7 His: C.7 

Sly: = 4 Tr.r: 3.6 Alt: €.7 Pro: 7.9 

T/T: 1.3 AT5: 5-0 V*l: 8.0 C.3 

11«: 5.9 L«u: 6.0 Ph«: 13.3 Ly« : 4.4 

pi es::iat:e: €.59 Rak?* s**r=h«d: ( €.64. 7.14) 
Mw ts::=i:ff : i€ECC R*n?« se&rracg: (13440. 20160) 

CIcsest SKZS5-PR2? er.cn es £ar rht species rzCLZ marched toy aa =sr~*s- en . 
R*nk Seer* Prcrein pi «w Description 

1 24 FTK.rCOLX €.14 16999 AJMftTAa CUKAKOTLTMRSniUR 

: 39 C3^.S=2i: 6.32 36359 PAKTCTHENATI KINASE t£C 2.7 1 33 j 

3 40 ~TA_£~: 5.06 35713 K0M3SSRZKE D-SUCCZNYLTRANrr*RAS£ 

4 42 rAsr.rm: 5.52 57ei2 transcriptional activator cam 

5 43 klyz.zzzlz s.sb 19759 hskolysxh c. plaskid. 

Clsses-. SV;:ss-?Rrr er.rrie* for rrC-I with F i „ d mw values in wc-^d 
rar.aef # — «o 

Ranr. Srrre ?r=tem pi Mw Description 

1 24 PTJC.rCOLZ 6.S4 1C9I9 A5PAR3ATZ GAJULAtfOTHftAltSrrftAtt 

2 1C2 TRJs.ZZ^: 6.73 17921 TRAJ PROTEIN. 

3 112 6.79 19028 HVPOTHSTICAl, 1IPOPRCTTIN YAJC 

4 140 YrJB.-DlI 6.83 14945 HYPOTHETICAL 14.9 RD PROTEIN IN GRP£ 

5 142 YAKA -=11 7.06 14726 HYPOTKETICA2. PROTEIN IN BETT 3 'REGION 

Ftcure 4. Computer printout trom ExPASv server where ihc empirical amino acid companion 
estimated pi and MW of 3 prntcm imm a 2-D reference map of£. u C rc matched acain*t all entric* in 
SWISS-PROT ior£ 1 The correct identification, aspanate carhamin hran«lcra«e. 1* %ho« n in hold Low 
^•ircs indicate j eood match Note h.m matching within adclincdpl and MW ranee t loner *ei oi prmcinM 
h.v -reaiK increased ihc wore dtlicrencc between the firM and «ccond rankins protein* Tin. wore 
difference imc* h.eh confidence in the identification, and 1* onl> oKcrxcd where ihc top rankin- protein 
1% the correct identification t Willis ei al '.. I9¥5» r 

graphx -based anal\ sis. Proteins blotted to PVDF membranes can be hydroiy scd in I h 
ai I55 r C. ammo acids extracted in a single brief step, and each sample automatically 
dcnvatised and separated by chromatography in under 40 minutes (Wilkins a aL 
I995;Our/<//.. 1995). In this manner, one operator can routinely analyse 1 00 proteins 
per week on one HPLC unit. This technology lends itself to automation, and it is 
anticipated that instruments with even greater sample throughput will be developed. 
W hen proteins have been prepared by micropreparative 2-D electrophoresis < Hanash 
ci aL 1991: BjellqviM a aL 1993b). blotted 10 a PVDF membrane and stained with 
amido black, any visible protein spot is of sufficient quantity lor amino acid analysis 
iCordwell a aL 1995; Wasinger et aL 1995: Wilkins ct aL 1995 j. 

After the amino acid composition of a protein has been determined, computer 
programs are used to match it against the calculated compositions of proteins in 
databases ( Eckerskom ct aL 1 988; Sibbald. Sommerfeidl and Argos. 1 99 1 ; Jungblut 
a aL 1992; Shaw. 1993: Hobohm. Houthaeve and Sander. 1994; Wilkins ei aL 
19951. Matching is usually done with only 15 or 16 amino acids, as cysteine and 
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?os: 



Asx: S.< CIx: ICS Str: 4.1 Kit: 2.7 

Ziy: II. V Thr: 2.6 Al* : 11.9 Pro: 3.2 

r/r: €.» Arg: 3.7 Val: 9.5 ««t: C.4 

lit: 5,1 Leu: 8.2 Pht: 3.2 lyt: 4.9 

pi «szi»-t«: £.99 Rang* st&rehtd: ( 5.74. €.24) 
Mv «stir»r«: 45000 Rtr.?» scorched: (34000. 54000) 



Closes 


t JV. 




i*s f cr 


ecci: 


with pj and Mv vt 


r*r.?t : 












Rank S 


cert 


Prcztia 


PX 


Hv 


N-t train*! Stq. 


1 


21 




1.03 


45314 


nut 


2 


32 


VJSS.SCC12 


5.86 


36502 


K S M I X 


3 


38 




5.78 


45774 


KSNSX 


4 


44 


yihs.ecoi: 


5.86 


48011 


KHU Y 


5 


45 


DHE4_SrCiI 


£.98 


48581 


K 0 0 T Y 


6 


46 




5.79 


43765 


K A I E 0 




46 




5.78 


37851 


«NHSI 


e 


4? 




5.98 


49162 


UNKA 


9 


* * 




5.85 


43290 


M S 5 X L 


10 


50 




6.01 


37064 


K E 5 K I 



Figure 5. A PVDF nt.icin <p<« Irnm an £ ,.,/, ;. D relercn.c map »as sequent l„r J k ^| C s and , he 
same sample men suriei i in aminn acid anal> sis. The N-icrminal scqucntc was M L K R When ihc • 

PROT l..r£ ,„/.-. the ahmc list ofhesi mauhes u priced. N.icrro.nal sequences are Iron, SWIS VPROT 
for those entr.es The u-r ranking .Ucnt.fuan..n c,f senne hvdroxymeihvltransierasc i hold , dul „„*, sh ,m -, 
larce w.w J.l.eren,, hemeen the fir« and second rank.ng pr.nc.ns. c.x.nj: l.ttk- cmlutawc in ,h,. h, J 
Hie Lurrci-t pmte.n .dent.ficai.on However, the wquciwc ta.c iM L K R. wmfinncd ilu: .Jcnim „. ,ht 
protein as serine hydnnymeihsliranslerasc • 1 ,nt 

tryptophan are destroyed during hydrolysis, asparaginc and eJutatnine arc dcamidated 
10 thr,r corresponding acids, and proline is noi quantitaicd in sonic analysis s VM ems 
The computer programs produce a li st of best matching proteins, which arc ranked bv 
a score that indicates the match quality. Some programs allow maichinc to he 
restricted to specific windows" 0 f MW and pi (Hobohm. Houihaeve and Sander 
1994; Wilkins « „/.. 1995). and to protein database entries for one species (Jun-hlui 
a aL 1 992: W ilkins a aL 1 995 ». The use of such restrictions increases the P ow"cr of 
matching. An example of protein identification by amino acid composition is shown 
in Figure J. To date, amino acid composition has been used to identify proteins from 
reference maps of Spirapiasma melliferum. Mycoplasma ttnuiuhum. t cnli. Smtha- 
n.myces cerrvisiae. Dicrynsielnau disc auburn, human sera, human hcan. human 
lymphocyte, and mouse brain (Cordwell ct aL 1995: Wasincer t-i aL 199*- Wilkins 
« «/.. 1 995: Jungblut et aL 1 992. 1 994: GarTels e t aL 1 994^ Frey a aL 1 994 j. 

PROTEIN IDENTIFICATION BY AMINO ACID COMPOSITION AND N-TERMINAL 
SEQUENCE TAG 

When samples from 2-D gels are not unambiguously identified bv 



amino acid 
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* ilLin. c: „/.. 19*,. Takin , udvant^^ 0 K ^ ^^Sl^!- 

•pcnromein "sequence uif concept .\ Iann „, uT*Z * ff havf Ae mas\' 
Nned Edman derradatton and amino acid ^a " " 1 ? * - com. 
< Wilkin. « «/.. submitted, Th,> involve, ihX\l^ P? ? " ' rr ° lif,n i *"»fcai.on 
proteins by Edman degradation for 3 or 4 cvel, ! *l«M«nf of PVOF-hlotted 
uh, ch the wm: sample U s C d foram.no acid SaTv^A * '7^"" ^ ,o11 ^ »"= 
•--moved imm the prote.n. ,.s compovuion "™> M ^only a f tf « animp a , ltK ; 
-inceonh ^m.ll ;„„o unt of P ro« e ,„ seo'^ M - Cn,fic:,r ' 1 '' tiered. Furthermore 
nman desradawon cycles £ he «^2££~»« * ul ^ «W«ncTSi 
alio* 3 eveies ,„ hc completed in J h. therehv h/"™ I"***™* should 

protems per week on one automated. mUi^^l S wenin * of »*» more 
Muon. pi and MW of protems are match d a^T^ ™ d 
V.ermma, sequences of he, matchinc ******* *n e. and 

■o confirm the protein identify iF^c "^™' ^ ,ht tac' 

proteins are .V.erm.nalK blocked', hut a," onlv a fcuT * ' Cs> UsefuI ^ 
^centible to the acetyl, formyj. or ^lut^i ^r ™ m ° ™ 

' - may „se,f provide useful information £t£?^ !" bl "^- 
ol N-icrminul sequence ug and amtno acd cornet on ' ldem,r,cali "»- A xrencth 

data generated are quickly and easily interp^d Pr ° ,C,n ' dcmifi ™™ ,s «ha, P 

PR0T£ ' V ,0£NT » nC ^ «' ™™ MASS nNGERpRivj-j \q 
Tecnmques for the identification of proteins bv .mh 

nrcen.lv been described tHenzel „/ To^l^'t maSS fin ?"Primin C have 
W ,,„/.. IW: N1ann.Hoirup and Ro.oMoVf> !00," v''^ and B,Cash - v - 'W: 

S — „/.. 1995, This inv^ ^^: * T W * 
»Mn S rcMdue-snecific enzymes, the deteniLatlo^n PCr,,dt " fr ° n ' P™** " 
•n* of these masses a?ainst .heoretica^^ hh£T ' ^ match- 

sequence databases. As pr0leinN ^ differen ? m ino V """^ ^ P^«« 
>hould produce cW,er,Mk- -finseipr,™ ' ^ M?qucnce "- Wictes 

f Protems „ Ihm the 

d.pests arc reported to produce more en™ ' ^ es,ed,l '- w «-i«Hhou C h«, w/ w .. d 
subsequent pe^nu,/^ ,1 u'^^r ^ h C °" : ^ 

M«u „/.. J994,. The cwvme'of choice fi r' a «/.. , y^. 

modif,cd S e q uencin ??r ad,,hm ^ ^~ .r>N„ <of 

JK ° *™ u ^ «P a ppm. Hojrup and Bleashv 199^ T" "" n '" s ^'^ protease i have 
!*P»ide, obtained, it is desirable for Dr0 ie!n ^J. J° mJX ' mise ,he nun ^r of 

nonds of the pro.e.n are broker, and produces Z ' Th, \ ensiWs ,ha * dis„ir,d e 
amenable to d J? es IIOn . Sur P r,.s, n2 h COni ™™^ ^ are more 

nrom,de .meth.on.ne specific Voml , h " " mCth ° dN sUL ' h ^ ^anc,. en 
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Funhermore. if pro.e.n alleviation and reduction hL n ' « <"•• l*U,. 

Protetn di-e^ion. peptide sequence coverage ma " t ™ ^ Unden:,ken P*» io 
™^P'~n<represent^^ 

< Mon2 r; „/.. , 904 , For eukarvotes. a ^'^T^'. 1 ""™ in *e "rwein 
* the presence of postradiational ^ S^Tf Widfm ^ 
unmodified pepude aione can be ^ W^^L?™'* 1 ^ of the 
canon, mtroduced by electrophoresis. an a "^ af ! 5? Tu o anifa ««al modifi- 
ox.ifa.iwof methionine. are diotaKm,^^^ " and the 

Hes< « «/.. 1 993 ). Peptide masses ( | c Maire vi »/.. j 993. 

:""' n ^.L *,5J£ ft ; .*»> 




C.irhrixxliinnnn, Avpor Giu 
'^nnmi.imm ni Avn nr Gin 
O.xuliidc h..nd inrmaiiun 

r^rrm Inimn 

Hi^oominc* tCkS. GjINi 
H^o.r.rGL.Gjl. Mom 
H>dn>\vbi„, n 

Wi> IhrvKammrs iCLNa.. CjINac. 

r ^xidjnnn of M CI 

Pwnyluiamu- to d mnned rrnm Gin 

.vid iNcuN'Act 
Sullniifm 

TaNf m.KJihed imm Fmnican HSPPvut ! ■ 



*7| f Ml 

- -J 01 
- : 0: 

• 2x.cn 

- IM Id 
• 162 U 

- IMKI 

- 2u.; iv 

- lit Ml 

- 7q VK 

-17(0 

- 2YI.26 



38 Marc R. Wil^S ei al. 

A number of computer programs are available for machine peptide ma,*, a «. aifK , 
databases trevi-wed in Cottrell. 1994). Matching is usually undertaken in an trlerac- 
live manner, whereby peaks of mass 500-5000 Da are selected and matched under 
various search parameters including MVV of protein, mass accuracy of peptides and 
number of missed enzyme cleavages allowed iHenzd etal.. 1995; Monz rial.. I9Q4: 
Rasmussen ei al.. 1 994 1. The correct protein identity is the protein which h a, the most 
peptide masses in common with the unknown sample, Identities have been established 
with as few as three peptides, but unambiguous identificaticn is thoucht to require a 
mass speetrometnc map covering most peptides of the prosein (Monz « al. I 
Yates ei al.. 1993i. To date, peptide mass fingerprintirg of proiein> ha, been 
undertaken from the human myocardial protein and'keratinoryte maps, from an £ coh 
:-D gel. and from reference maps of Spimplasma mellumm, and Munphisma 
yetutalium (Sutton etal.. 1995: Rasmussen « a/.. 1 994: Henzel <•/«/.. 1 993: Cordweil 
ei al.. 1995. Wasinser ei al.. 1995). although the technique is most powerful when 
used in combination with another protein identification technique (Rasmussen ci al 
1994: Cordweil etal.. 1995). 
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M ASS SPECTROMETRY SEQUENCE TAGGING 

An extension of peptide mass fingerprinting has recently been described, called 

peptide sequence tagging (Mann and Wilm. 1994: Mann. 1995). This use, tandem : 
mass spectrometry ( MS/MS ) to initially determine the mass of peptides, then subject 

them to fragmentation by collision with a gas. and finally determine the mass of B) 
fragments. The resulting spectra gives information about a peptide's amino acid 
sequence. The fragmentation masses of peptides can rarely be used to assign a complete 
sequence, but it usually allows a short -sequence tag* of 2 or 3 amino acids to be *e« 
determined. This sequence tag and the original peptide mass is matched by computer 

against a database, providing a likely identity of the peptide and the protein it came from. J J{ 

The major drawback for this technique as a mass screening tool i s the complexity of the " lo; 
mass data generated and the high level of expertise required for its interpretation. ?; 
Nevertheless, it represents a useful new protein identification method which urcatly 
increases the power of peptide mass fingerprinting protein identification. 

1 

2 

Cross-species protein identification 3 

4 

Protein sequence databases continue to crow at a rapid rate. >ei it is not widely 5 
appreciated that close to 905 of all information contained in current protein databases 7 
comes from onl\ 10 species < A. Bairoch. Pers. Comm. 1. Fonunaielx . this information • 
can be used to study proteomes of organisms that arc poorh defined at the molecular ? 
level, via 2-D electrophoresis and -cross-species" protein identification (Cordweil a 

al.. 1995: Wasmgcrr/tf/.. 1995). This approach allows protein, from reference maps Rt«m 

of many different species to be identified without the need for the corresponding -jencs amhr> 

in be cloned and sequenced. This is particularly true for 'housekeeping prote.nV'such idemn 

a* enzymes involved in glycolysis. DNA manipulation arid protein manufacture. aiuJH. 

which are highly conserved across species boundaries. Proteins that cannot be *" 

identified across species boundaries can then become the focus of further protein £w 

characterisation and DNA sequencing efforts. match 
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Raptd cro»-specKs idenuricauon of pm.ein.s 'rom : . D r . ffrffnce 

. F.cure 6,. but these •echniques alone max no. identify pro.cn, UBaiKfui^L 
phylofeneiic cros*-»peue< distance sare rea. or analysis data » of poor oual,Tv*( y„« 
ft «/, 1993: Shau. ,993: Corduell * */.. 1995,. Houever. verv L £fitae£ 
protein identities can be achieved when Km* of best-matehing proteins eenerated'bv 
both techniques are compared (Cordwell et aL 1995: Wasinger et at' I99S> The 
correct identification ,> found when the same protein is ranked'hichlv in list, of b-s 
matches generated by both techniques. This method has allowed appYoximatelv ro 
proteins from the reference map of the mollicutc 3p,ra P lasma mcWerum represent 
ins approximately one quarter of the proteome. to be confidently identified hv 
reference to protein information from other species ,S. Cordwell. Personal Communi- 
cation,. When cross-species protein identification is , 0 be undertaken, it shouTd be 
noted that the molecular weight of a protein type across species ,s usually hi-hk 
conserved, but that protein pi can vary by more than 2 units (Cordwell et ai \oU\ 
Accurate molecular weight determination by direct mass spectrometry of 'nroiein, 
blotted to PVDF (Eckerslcom er aL 1992, should therefore' be a us ? U J " 
parameter for cros>-species protein identification. -oomoiui 

CHARACTERISATION Or POST-TRaNSLATIOSAL MODIFICATIONS 

Many protein, are modified after translation. Such post-translational modifications 
including glycosylate, phosphorylation, and sulfation isee Table 6) are usually 
necessary for protein function or stability. Some abnormal modification, are as,oci 
ated with disease (Duthel and Revol. 1993: Ghosh ct aL 1993: Yamashita a at 
199?,. In proteome stud.es. post-translational modifications can be examined on all 
proteins present, or on individual spots. Studies on all proteins provide an indication 
of wh.ch protein, may earn a certain type of modification. For example -cl 
analyse of cell culture, grown ,n the presence of ['H] mannose or ["PI phosphate 
gn an indication of which proteins carry glycans contain^ mannose. and vvhich 
proteins are phosphorylated (Carrels and Franza. 1989,. Lectin hndint: studies 0 P -D 
gels blotted to PVDF or nitrocellulose provide information on the saccharides if ^ 
that are carried by proteins present (Gravel ct al.. 1994). ' " ' 

When md. vidua! protein, of interest carrying pos.-translational modifications have 
been found, m.cropreparai.ve 2-D electrophoresis can be us-d to purifv them m 
myogram quanta^ .Hanash ct aL 1991: Bjellqvist „ aL 1993b, If prote n 
informs o similar MW and pi are to be studied, focusing u„h narrow ran -e pi 
gradient, « I pH unit , can provide greater separation and resolution. After electro 
pnores.s the , ypt ^and degree of protein phosphorylation can he ,n vested , Murthv 
and Iqbal 199. : Gold * aL 1994,. monosaccharide composition can be det rTn d 

l v-oamm 0 " H IT" ^ " ^ *« "* ™™< »' <™ ^ 
LiSoammo acids can be investigated b> either Edman degradation based techniques 

or by mass spectrometry- 1 Pisano ct aL 1 993: Hubenv ct aL 1 993 Carr Huddl*.unn 

and Bean 1993,. W„h further development of rapid technique.^nves^ "f 

phosphorylation and monosaccharides by chromatographic or mass specirometric • 

Z7r> r\ " ? bSC ° me f 3 r0UUne MCP in eha "«^on of post-lransla, on 
modifications of proteins from reference maps. 
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The statu of proteome projects 



Many technical aspects of proteome research have ^ 
'evicu j.uianoven j-u ofthjstaiu^oforaicorn.^ * discussed in this 
Advances in proteome P ro.iec, ^ S^S^?^ 1 ™ 
muiuves. to enable an identitv. ammo acid seau-n^ I r ' ^ 
rach pro™ spot. TaWr 7 shows J^^^Z!^ ' 0 ** » 
pre*.* a ,read> dfi fi„e d for a number ofL^oZ ^Z'T/ ** ° f 
genome sequencing proerams for £ r 0 /,and.? r indicates that whilst 

sixe of .ome other genomes ,and I .^eSj ^ I " advanc ^. the mass, v e 
conplet- nucleotide sequences are unlikelvto be IvZuJnZ™ " m ,hal the,r 
«h»s. :. D reference maps and proteome p^ect/ofll ^E?'*^**""'* 
plasma sp.. £. coli and S. cerex-isiae will be the J J ° r *wi«» like A/u,>- 
Wasineer „,/.. J995: Vanbo* J„ S W^^ ri / Ctrt *««««/- 1995: 
maps of other organisms will" take lonee'r to conW '"l*" C ° mp,C,e 

specie* protein identification technique', will IwT' H ° Wever ' lhe use of «*»- 
and simple eukarvotes ,o be partially defined in «*e££^ 

r.isrenw map* i«w <om f mode) or$ 3n ,<m.v Genom- s.7e dl« £™ T '' rkB,m » '^emm on : . D 

' " Jm > or complete tn 
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FUTURE DIRECTION'S Or PROFO.Sifc FKOJSC TS 

This rsvi-u- has described recent advances in the area of proteome research It has 

illustratedhouneudevelopmeni5ofoldettechniques(:.Delectrophore.s-.;.ndamino 
acid analysis i as well as the applications oi new technology i mass spectrometry ) have 

greatly widened the choice of tools the biologist and protein chemist has for the v. 

reparation, identification and analysis of complex mixtures of proteins. This ha< made 

possible the establishment of detailed reference maps for organisms, which ar- 

becominc the method of choice for tht definition of tissues or whole cells, and the 

investigation of g ene expression therein. * r: 

Proteome projects are already impact*? on the dogma of molecular biolo-v that 
DNA sequence constitutes the definition ot an organism. For example, the proteomes * p '- 
of different tissues of a single organism are often significanilv different Similarlv 
cross-species identification of proteins (for example the identification of proteins 
from Candida albicans by comparison with S. cerevisiae ) can open up studies on 
organisms that are poorly moleculariy defined. As cross-species identification can 

proceed at a pace orders of magnitude faster than a genome project in temx of Am 
defining the gene and protein complement of orsanims. the need for the DNA 
sequencing of genomes will be avoided, and emphasis. placed on those found to be 

novel. Bvu 
Just as genome sequencing is not an end in itself, neither is an annotated 2-D protein Bar 
reference map of an organism, nor indeed the identification of proteins in a proteome 
So whilst an immediate aim of proteome projects is to screen proteins in refer-nce 

maps, this will l-ad to expression studies and characterisation of post-translational BaK 
modifications. The challenge that then needs to be addressed is the investigation of 

structure and function of proteins in a proteome. The magnitude of this is illustrated hv Bah 
the fact that over half the open reading frames identified in S. cerevisiae chromosome 
III were initially of no known function (Oliverev a/.. 1992). Structural and functional . 
studies will be an undertaking just as formidable a< genome studies are now and BER * 
proteome projects are becoming, but will lead to an unimaginably detailed under- 
standing of how living organisms are constructed and how they operate. : • ' 
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Human cellular protein patterns and their link to genome DNA 
sequence data: usefulness of two-dimensional gel 
electrophoresis and microsequencing 
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ABSTRACT Analysis of cellular protein patterns by 
computer-aided 2 -dimensional gel electrophoresis together 
with recent advances in protein sequence analysis have 
made possible the establishment of comprehensive 
2-dimensional gel protein databases that may link pro- 
tein and DNA information and that offer a global ap- 
proach to the study of the cell. Using the integrated ap- 
proach offered by 2-dimensional gel protein databases it 
is now possible to reveal phenotype specific protein (or 
proteins), to microsequence them, to search for homology 
with previously identified proteins, to clone the cDNAs, 
to assign partial protein sequence to genes for which the 
full DNA sequence and the chromosome location is 
known, and to study the regulatory properties and func- 
tion of groups of proteins that are coordinately expressed 
in a given biological process. Human 2-dimensional gel 
protein databases are becoming increasingly important in 
view of the concerted effort to map and sequence the en- 
tire genome. Celis, J. E.; Rasmussen, H. H.; Leffers, 

H.; Madsen. P.; Honore, B.; Gesser, B.; Dejgaard, K.; 
Vandekerckhove, J. Human cellular protein patterns and 
their link to genome DNA sequence data: usefulness of 
two-dimensional gel electrophoresis and microsequencine. 
FASEB J. 5: 2200-2208; 1991. 

Key Words: human protein patterns • 2 -dimensional gel protein 
database* * gene expression * microsequencing * cDNA cloning 
- linking protein and DNA information • genome mapping and se- 
quencing 



Proteins synthesized from information contained in the 
DXA orchestrate most cellular functions. The total number 
of proteins synthesized by a typical human cell is unknown 
although current estimates range from 3000 to 6000. Of 
these, as many as 70% may perform household functions 
and are expected to be shared by all cell types irrespective of 
their origin. There are many different cell types in the hu- 
man body with perhaps 30,000 to 50,000 proteins expressed 
in the organism as a whole judged from the fact that about 
39c of the haploid genome correspond to genes. Todav onlv 
a small fraction of the total set of proteins has been identified, 
and little is known about the protein patterns of individual 
cell types or their variation under physiological and abnor- 
mal conditions. 

For the past 15 years, high resolution 2-dimensional gel 
electrophoresis has been the technique of choice to deter- 
mine the protein composition of a given cell type and for 
monitoring changes in gene activity through quantitative 
and qualitative analysis of the thousands of proteins that or- 
chestrate various cellular functions (refs 1-6 and references 



therein). The technique originallv described by OTarrell i 
separates proteins in terms of their isoelectric point ^pl) an. 
molecular weight. Usually one chooses a condition of in- 
terest and the cell reveals the global protein behavioral 
response as all detected proteins can be analyzed both 
qualitatively and quantitatively in relation to each other. At 
present, most available 2-dimensional gel techniques (regu- 
lar gel format) can resolve between 1000 and 2000 protein> 
from a given mammalian cell type, a number that cor- 
responds to about 2 million base pairs of coded DNA. Le:- 
abundant proteins can be detected by analyzing partial] 
purified cellular fractions. 

Two-dimensional gel ectrophoresis has been widely applied 
to analysis of cellular protein patterns from bacteria to mam- 
malian cells (refs 1-6. and references therein). In spite of 
much work, however, information gathered from these 
studies has not reached the scientific community in its full- 
ness because of lack of standardized gel systems and the lack 
of means for storing and communicating protein informa- 
tion. Only recently, because of the development of appropri- 
ate computer software (7-13). has it been possible to scar 
gels, assign numbers to individual proteins, and store tht 
wealth of information in quantitative and qualitative com- 
prehensive 2-dimensional gel protein databases (4, 14-23), 
i.e.. those containing information about the various proper- 
ties (physical, chemical, biological, biochemical, physiologi- 
cal, genetic, immunological, architectural, etc.) of all the 
proteins that can be detected in a given cell type. Such in- 
tegrated 2-dimensional gel protein^databases offer an easy 
and standardized medium in which to store and communi- 
cate protein information and provide a unique framework in 
which to focus a muhidisciplinary approach to study the cell. 
Once a protein is identified in the database, all of the infor- 
mation accumulated can be easily retrieved and made availa- 
ble to the researcher. In the long run, protein databases are 
expected to foster a wide variety of biological information 
that may be instrumental to researchers working in many 
areas of biology— among others, cancer and oncogene 
studies, differentiation, development, drug development and 
testing, genetic variation, and diagnosis of genetic and clini- 
cal diseases (Fig. 1). 

The approach using systematic 2-dimensional gel protein 
analysis has recently gained a new dimension with the ad- 
vent of techniques to microsequence major proteins recorded 
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Figure 1. Interlace between partial protein sequence databases, 
comprehensive 2-dimensiona] gel databases, and the human ee- 
nome sequencing project. Appropriate software is required to com- 
pare protein and DNA sequences. In general, although the infer- 
ence of a proteins sequence from the DXA sequence (thick arrow i 
is direct and unambiguous, the DNA sequence can only be inferred 
approximately from the protein sequence (thin arrow) and cloning 
if the gene requires either a cDXA or the requisite group of 
jiigonucieotide probes deduced from the partial amino acid se- 
quence. Modified from ref 6. 



in the databases (refs 24-42 and references therein). Partial 
protein sequences can be used to search for protein identity 
as well as to prepare specific DNA probes for cloning as-yet- 
uncharacterized proteins (Fig. 1). As these sequences can be 
stored in the database (see for example Fig. 2H), thev offer 
i unique opportunity to link information on proteins with 
he existing or forthcoming DNA sequence data on the hu- 
man genome (Fig. 1) (20. 36, 39). 

Using the integrated approach offered by comprehensive 
2-dimensional gel databases (Fig. 1), it will be possible to 
identify phenotype-specific proteins: microsequence them 
and store the information in the database: search for homol- 
ogy with previously characterized proteins; clone the 
cDNAs. assign partial protein sequences to genes for which 
the full DNA sequence and the chromosome location are 
known, and study the regulatory properties and function of 
groups of proteins (pathways, organelles, etc.) that are coor- 
dinately expressed in a given biological process. Comprehen- 
sive 2-dimensional gel protein databases will depict an in- 
tegrated picture of the expression levels and properties of the 
thousands of protein components of organelles, pathways, 
and cytoskeletal systems in both physiological and abnormal 
conditions and are expected to lead to identification of new 
regulatory networks in different ceil types and organisms. In 
the future, 2-dimensional gel protein databases may be 
linked to each other as well as to national and international 
specialized databanks on nucleic acid and protein sequences, 
protein structures. NMR experimental data, complex carbo- 
hydrates, etc. 

A few 2-dimensional gel protein databases that are accessible 
in a computer form have been published in extenso: these 
correspond to the protein-gene database of Escherichia coli 
K-12 developed by Neidhardt and colleagues (14. 23), the rat 
REF 52 database established by Garrels and co-workers at 
Cold Spring Harbor (18. 22). and a few human databases 
(transformed amnion cells [15, 20]. normal embrvonal lung 
MRC-5 fibroblasts [17. 21]. keratinocytes [19] and peripheral 
blood mononuclear cells [15]) developed in Aarhus. Given 
space limitations and to keep this review in focus, we will 
concentrate on the computerized analysis of human cellular 
2-dimensional gel patterns, and in particular on the steps in- 
volved in establishing comprehensive 2-dimensional gel 
databases that can link protein and DNA information. 



MAKING AND MANAGING A COMPREHEXnIVK 
2-DIMENSIONAL GEL DATABASE OF HI'm'aV ' 
CELLULAR PROTEINS 

The first step in making a comprehensive 2-dimensiona; 
protein database is to prepare a synthetic imaee i digital :or::: 
oi the gel image) of the gel (tiuoroeram. Coomassie bine or sil- 
ver stained gel) to be used as a standard or master reference. 
This can be done with laser scanners, charge couple device 
(CCD) 2 array scanners, television cameras, rotatmc drum 
scanners, and multiwire chambers 1 13). Computerized anal- 
ysis svstems for spot detection, quantitation, pattern match- 
ing, and data handling (access and retrieval of information, 
database making) have been described in :he literature 
(ELSIE [43], GELLAB [11]. HERMeS (441. MELAMF 
[10]. QUEST (9), and TYCHO (8]) and some are available 
commercially (PDQUESX Protein Database Inc.. Hunting- 
ton. N.Y.; KEPLER, Large Scale Biology. Rockville. \UL: 
Visage. Biolmage Corporation. Ann Arbor. Mich.: Gemini. 
Joyce Loebl, Gateshead: Microscan 1000. Technology 
Resources Inc., Nashville. Tenn. and MasterScan. Billerica. 
Mass.). Unfortunately, most of these systems are incompati- 
ble with one another and their advantages and disadvantages 
haye been discussed by Miller (13). 

In our work station in Aarhus. fluorograms arc scanned 
with a Molecular Dynamics laser scanner and the data are 
analyzed using the PDQUEST II software (Protein Data- 
bases Inc.) (12) running on a spark station computer 4100 
FC-8-P3 from SUN Microsystems. Inc. The scanner meas- 
ures intensity in the range of 0-2.0 absorbance. A typical 
scan of a 17 x 17 cm fluorogram takes about 2 min. Steps 
in image analysis include: initial smoothing, background 
substraction, final smoothing, spot detection, and fitting of 
ideal Gaussian distribution to spot centers. Spot intensity is 
calculated as the integration of a fitted Gaussian. If calibra- 
tion strips containing individual segments of a known 
amount of radioactivity are used, it is possible to merge mul- 
tiple exposures of the sample image into a single data image 
of greater dynamic range. Once the synthetic image is 
created it can be stored on disk and displayed directiv on the 
monitor. Functions that can be used to edit the images in- 
clude: cancel (for example, to erase scratches that mav have 
been interpreted as spots by the computer: cancel streaks or 
low dprn spots), combine (sometimes a spot mav be resolved 
into several closely packed spots), restore, uncombine, and 
add spot to the gel. The process is time consuming- about 
1-1/2 day per image. Edited standard images can be matched 
to other synthetic images. Figure 2A shows a portion of a 
standard synthetic image (IEF) of a fiuorogram of 
( 35 S]methionine labeled cellular proteins from human AMA 
cells (master database) (20). Images can be displayed either 
in black and white (resembling the original Huorograms) or 
in color (other images in Fig. 2), depending on the need. As 
shown in Fig. 2B, each polypeptide is assigned a number bv 
the computer, which facilitates the entry and retrieval of 
qualitative and quantitative information for anv given spot 
in the gel (20). The standard image can be matched auto- 
matically by the computer to other standard or reference gels 
(Fig. 2C, matching of AMA cellular proteins [left] to MRC-5 
proteins [right]) provided a few landmark spots are given 
manually as reference (indicated with a + in Fit;. 2C) to in- 
itiate the process. 



Abbreviations: CCD. charge couple device: PCNA. proliferat- 
ing cell nuclear antigen: HPLC. high performance liquid chromatog- 
raphy. 
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Figure 2. A) Synthetic image of a fraction of an IEF gel of the master imaec'of AM A cellular proteins Bi As in -1 h... «h,»-i B . I 
a«,en« .0 each spot Q Companson of AM A (left , and normal human embrv ( ,nal lunu MRc" Hb oi lal' n , ,' IEF t , n "7™ 

Ja , h\ PoK^e r H f CVl ° ske,eta] ? nd Cvtoskel ^^-^lated proteins in quiescent, prolik-ratint;, and SY^i-transformcd \IRC% fibmb- 
Us*. H) Poh peptides that contain information under the category partial amine, acid sequences. 
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The automatic matching process that has been described 
in detail by Garrels et al. (13) takes about 5 min. Matched 
proteins are indicated with trie same letters in both gels (Fig. 
2C). The usefulness of this function is emphasized by the fact 
that data accumulated on common household* proteins can 
be easily transferred to any other human cellular cell type 
whose 2-dimensionaJ gel cellular protein pattern is matched 



to our standard AMA 2 -dimensional gel protein image. Al- 
ternatively, if the standard gel is part of a matchset (set of 
gels in a given experiment) it can be used as a linker gel to 
compare, for example, the quantitative values of a given pro- 
tein throughout the experiment (see Fig. 2D; levels of some 
proteins in normal and SV40 transformed human MRC-5 
fibroblasts) or with other standard images in different sets of 



cross-matched experiments (18, 22). ^ 

Once a standard map of a given protein sample is made 
pne can enter qualitative annotations to make a reference 
database. Our master 2-dimensional gel database of trans- 
formed human amnion cell (AMA) proteins (20) lists 3430 
polypeptides of which 2592 correspond to cellular compo- 
nents, having pi's ranging from 4 to 13 and molecular 
weights between 8.5 and 230 kDa. The most abundant pro- 
teins in the database correspond to total actin (3.87% of total 
protein; about 90 million molecules per cell) while the 
lesser abundant of the recorded polypeptides are present in 
the vicinity of 5000 molecules per cell. Some annotation 
categories we are using to establish the master AMA data- 
base include: 7) protein identification (comigration with 
purified proteins, 2-dimensional immunoblotting, microse- 
quencing); 2) amounts (total amounts and levels of synthe- 
sis); 3) subcellular localization (nuclear, cytoskeletal, mem- 
brane, membrane receptors, specific organelles, etc)- 4) 
antibodies; 5) postradiational modifications (phosphoryla- 
tion, glycosylate, methylation etc.); 6) microsequencing- 7) 
cell cycle specificity (specific variations in levels of synthesis 
and amount); 8) regulatory behavior (effect of hormones 
growth factors, heat shock, etc.) 9) rate of synthesis in nor- 
mal and transformed cells (proliferation sensitive proteins 
cell cycle specific proteins, oncogenes, components of the 
pathway (or pathways) that control cell proliferation)- 10) 
Junction (mainly from comigration with proteins of known 
function); 11) sets of proteins that are coordinately regulated 
(hierarchy of controls, differentia] gene expression in various 
cells etc.); 12) cDNAs (cloned cDNAs); 13) proteins that are 
specific to a given disease (systematic comparison of protein 
patterns of fibroblast proteins from healthy and diseased in- 

tW , ) k 7 * ) L CXprCSSion and ex P lo »ation of transfected 
cDNAs; IS) pathways (metabolic, others); 16) gene localization 
(genetic and physical); 17) effect of microinjected antibody 
on patterns of protein synthesis; and 18) secreted proteins 
Information entered for any spot in a given annotation 
category can be easily retrieved by asking the computer to 
display the information on the color screen. For example 

a* 8 ;* j ? WS a s y nthetic ima S c of a NEPHGE gel (master 
AMA database) displaying the information contained under 
the entry glycolytic pathway. Alternatively, one can use the 
Junction peruse annotations for spot to directly ask the com- 
puter to list all the entries available for a particular protein. 
By clicking the mouse in a given entry (in this case, presence 
in fetal human tissues) it is possible to take a quick look at 
tne information in that particular entry (Fig. 2F). 

A major obstacle encountered in building comprehensive 
^-dimensional gel protein databases is identifying the laree 

3a U tTba e s r es° JaTSfl SCPara,Cd * ^ ,echnolo Sy- In ~r 
databases (20, 21), known proteins are identified by one or 

a combination of the following procedures: 1) comigration 
with known proteins, 2) 2-dimensional gel immunoblotting. 
using specific antibodies, and 3) microsequencing of 
Coomassie Bnllant Blue stained human proteins recovered 
from dried 2-dimensional gels (see next section). Protein 
identification by means of microsequencing may be difficult 
as individual protein members of families with short peptide 
differences may escape detection. In the gene-protein data- 
base of £. „/, K -12 (14, 23), another major 2-dimensional gel 
database available at present, proteins are being identified by 
a wider range of tests that include comigration with purified 
proteins; genetic criterion (deletion, insertion, frameshift 
nonsense, missense, regulatory), plasmid- bearing strains 
and in vitro .synthesis of protein; selective labeling (methyla- 
uon, phosphorylation); peptide map similarity; and physio- 
logical criterion and selective derealization 



rorl/!n I " ear,V 550 anti °°di« from labora- 

tested bv ' W ° rl f thCSC 3re be,n * -stemat.ca K 

tested b> 2-dimensional gel immunoblotting for antisren de- 
termination. Similarly, purified proteins "and orjanelt 
provided by several laboratories have greatlv aided ident.nca" 
tion of unknown proteins (20r21). We routinely request an,,, 
bodies and protein samples and promise the donors to make 
available all the information we mav have accumulated on that 
particular protein. For example, fable 1 lists entries availa- 
ble for Lipoconin V (IEF SSP 8216). also known as annexm 
v, vac-o, endonexm II. renoconin. chromobindin-5' an- 
ticoagulant protein, PAP-I, r caicimedin. IBC. calphobindm 
and anchonn CII. 

As mentioned previously, one distinct advantage of 
2-dimensional gel electrophoresis is the possibility of study- 
ing quantitative variations in cellular protein patterns that 
may lead to identification of groups of proteins that are ex- 
pressed coordinately during a given biological process 
Quantitation however, is not an easy task as reflected by the 
lack of published data on global cellular protein patterns. We 
believe this is partly due to difficulties in obtaining sets of 
gels that are suitable for computer analysis (streakine 
material remaining at the origin, etc.) as well as to limita- 
tions (laborious editing time, need of calibration strips to 
merge images, limited dynamic range, etc.) in the computer 
analysis systems available at the moment. Perhaps the most 
advanced quantitative studies published so far using com- 
puter analysis have been carried out by Garrels and co- 
workers (18, 22). In particular, these investigators have estab- 
lished a quantitative rat protein database (18, 22) designed 
to study growth control (proliferation, growth inhibitors, and 
stimulation) and transformation in well-defined groups of 
SS. ( i ne, J ob,a,ned b >' transformation of rat REF52 cells with 
6V40, adenovirus, and the Kirsten murine sarcoma virus 
Ihese studies have revealed clusters of proteins induced or 
repressed during growth to confluence as well as groups of 
transformation-sensitive proteins that respond in a differen- 
tial iashion to transformation by DNA and RNA viruses A 
most interesting feature of this quantitative database is the 
discovery of a group of coregulated proteins that show simi- 
lar expression patterns as the cell cvcle-regulated DNA repli- 

^r°M^/° te r n !^T n aS P roliferatin S «U nuclear antigen 
(rCNA)/cyclin (45). 

In our human databases, most quantitations have been 
earned out by estimating the radioactivity contained in the 
polypeptides by direct counting of the gel pieces in a scintil- 
lation counter (20, 21). Up to 700 proteins can be cm out 
through appropriate exposed films in a period of time com- 
parable to that required for editing a synthetic image. 
Manual quantuation of this large number of spots is difficult 
without the assistance of a master reference image and a 
numbering system that can be used to identify the spots Us- 
ing this approach, we have recorded quantitative changes in 
the relative abundance of 592 [ 35 S]methionine-labeled pro- 
teins synthesized by quiescent, proliferating, and SV40 
transformed human embryonic lung MRC-5 fibroblasts (21) 
Some data concerning cytoskeletal and cvtoskcletal-related 
proteins are presented in Fig. 2G. Our studies as well as 
those of Garrels and co-workers (18, 22) may in the long run 
help define patterns of gene expression that are characteristic 
ol the transformed state. 



OTHER 2-DIMENSIONAL GEL PROTEIN 
DATABASES 



As mentioned previously there are other 2-dimensional gel 
databases available in computer form that have been pub- 
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TABLE 1. Some entries for lipocortm I'm ihc human AM A 2 -dimensional gel protein database 



Kntrics for lipocortin V dEF SSP S2l6i 



Iniormation entered 



1 . Protein name 

2. Percentage of total protein 

>. Apparent molecular weight {mr) 

4. Isoelectric point (pi) 

5. Method (or methods) of identification 

o\ Credit to investigators that aided in 
identification 

7. Antibody against protein 

8. Comigration with human proteins 

9. Cellular localization 

10. Culcium/phospholipid-dependent 
membrane proteins 

! 1 . Function 



Lipocortin V. renocortin. chromobindino. endonexin I. anticoagulant prote:- 
PAP-I. \AC-a. 35-7-calcimedin. IBC. calphobindin I. anchorin CII. anncxm V 
0. 1107c (about 2.800.000 molecules per cell) 

33.3 kDa 
4.76 

Microsequencing. 2-dimensional immunoblotting. Comigration 

Dfr5? UW * J Vandeker£ *h°ve. and colleagues. Rijksuniversuen Gent; K IVpi-ukv 
BIOGEX. Cambridge; N.G. Ahn. University of Washington 

Polyclonal (rabbit, antibody no. 20). B. Pcpinsky. BIOGEX. Cambridge 
Lipocortin V.N.G. Ann. Howard Hughes Medical Institute. Wash.npron f«mrr>iix 
Subcortical membrane 
Lipocortin V 



Regulation of various aspects of inflammation, immune response, blood 
and differentiation 



i oau'iii.itiun 



12. Partial amino acid sequence 

1 c D \ A s eq u e n ce 

14. Levels in fetal human tissues 



13. Levels in quiescent, proliferating, and 
transformed MRC-5 fibroblasts 

lb. Distribution in Triton supernatant and 
cvtnskeletons 



CTVTDFPGFDER (7-18). VLTEIIASR (109-117). QVYEEEYCSSLEDm \ ( ' 
( 1 27-143 ). ?GTDEEKFITIFGT(R) (187-201) ^ ELL^ Gh»SLEDI)\ 

Known. R. Blake et al.J. Biol. Chem. 263. 10799-10811: 1988 
(pi = 4.76 from translated sequence) 



eye = 
f ; liver 



Adrenal glands - *■ + * ; brain 
cerebellum = + + * ; ear « + «• 
heart = * * * ; hypophysis = + 
lung = * + + ; meninges = * * + ; 
mesonephric tissue = + + j. ; 
striated muscle - + + * ; pancreas = * + * 
skin = + + + ; spleen = + + + ; stomach = 
submandibular gland » + + j. ; 
small intestine = - * : thvmus = + * * ; 
thyroid gland = + + + ; tongue = * -r * ; 
ureter = * + * 

Q (quiescent) = 1.1; P (proliferating) = 1.0; 
T (SV40 transformed) = 0.3 

Mainly supernatant 



lishcd in extenso: these correspond to the E. coli K-12 
protein-gene database (14. 23) and to the rat REF52 data- 
base (18. 22). 

The E. coli K-12 cellular protein-gene database is perhaps 
the most complete of all databases reported so far and even- 
tually it should trace each protein back to its structural gene. 
Iniormation contained in this database includes: gene/pro- 
tein name (protein name. EC number, gene name): 
2-dimensional gel spot designations (x-y coordinates from 
reference gels, alphanumeric designation); genetic informa- 
tion (linkage map location, physical map location. Genebank 
rode, sequence reference, location on Kohara clones); bi- 
ochemical information (molecular weight, pi. number of 
residues of each amino acid, mole percent of each amino 
acid, total number of amino acids in a polypeptide), and 
regulatory information (cellular level of protein in different 
media and different temperature, member of regufon. mem- 
ber of stimulon). Major advances of this database are en- 
visaged in the future in view of the eminent sequencing of 



the whole E, colt genome as well as the development of im- 
proved methods to express cloned genes. 

The rat REF52 2-dimensional gel protein database lists 
about 1600 proteins that have been recorded using the 
QUEST analysis system (18, 22). Included in this quantita- 
tive database are /) protein names (cytoskeletal and heat 
shock proteins as well as various nuclear, mitochondrial, and 
cytoplasmic proteins), 2) annotations (subcellular localiza- 
tion, modification, recognition by specific antibodies, 
coprecipitation, NH 2 -terminal sequence, cross-reference to 
protein sequence information and references to the litera- 
ture), 3) protein sets (cytoskeletal proteins, phosphoproteins. 
sets of proteins with PCNA/cyclin-like properties, etc.) and 
4) general quantitative data (protein synthesis during growth 
of normal REF52 cells to confluence and quiescence, and af- 
ter restimulation of growth-inhibited cells). 

In addition to the 2-dimensional gel databases mentioned 
so far there are several smaller cellular databases being es- 
tablished in human (normal human diploid fibroblasts, lym- 



phocytes. leukocytes, leukemic cells) mouse (NTH/3T3 cells. 
M T lymphocytes), Aphsia. yeast (Saccharomvces cercvisae), plants 
(wheat, barley, sorghum), and Euglena. Databases of tissue 
protein, (brain, whole mouse, liver) and body fluid proteins 
(plasma proteins, cerebrospinal fluid, urine, and milk) are 
being established in several laboratories. The reader is 
directed to the review by Celis et al. (4) for details and refer- 
ences concerning these databases. 



MICROSEQUENCIXG HAS ADDED A NEW 
DIMENSION TO COMPREHENSIVE 
2-DIMENSIONAL GEL DATABASES: A DIRECT 
LINK BETWEEN PROTEINS AND GENES 

The development of highly sensitive amino acid gas-phase or 
liquid-phase sequenators (24), together with the establish- 
ment of efficient protein and peptide sample preparation 
methods, has opened the possibility to perform a systematic 
sequence analysis of proteins resolved by 2-dimensional gel 
electrophoresis. Indeed, generated pieces of protein se- 
quences can be used to search for protein identity (compari- 
son with available sequences stored in databanks) as well as 
for preparing specific DNA probes for cloning of as yet un- 
characterized proteins (Fig. 1). In addition, partial protein 
sequences can be stored in 2-dimensional gel databases (for 
example, see Fig. 2H) and offer a unique link between pro- 
teins and genes (Fig. 1). 

In the early 1970s gel electrophoresis was used to purifv 
proteins for sequencing purposes (reviewed by Weber and 
Osborn in ref 25). Proteins were recovered by diffusion and 
sequenced by the manual dansyl-Edman degradation at the 
nanomole level. This technique was further refined by using 
electro-elution to recover proteins and by miniaturizing the 
system (26). This method has been used extensively, but 
showed increasing drawbacks (low yields, protein samples 
contaminated by free amino acids, and NH 2 -terminal block- 
ing) as the amounts of handled protein gradually became 
smaller (e.g., at the 30 picomol level). 

Most of the problems referred to above have been 
minimized with the introduction of protein-electroblotting 
procedures (27-32). When proteins are blotted on chemi- 
cally inert membranes, it is possible to sequence the immobi- 
lized proteins directly without additional manipulations. 
Thus, depending on the amount of bound protein and its na- 
ture, this direct sequencing procedure generally yields NH 2 - 
terminal sequences containing 10-40 residues. As such, this 
technique was used to identify, by their NH 2 -terminal se- 
quences, differentially expressed major proteins from total 
cellular extracts separated on 2-dimensional gels. A major 
difficulty encountered in this procedure is the occurrence of 
frequent artefactual blockage of the proteins. Several studies 
suggest that this phenomenon is mainly due to reaction with 
contaminants (particularly unpolymerized acrylamide 
present in the gel) and to a high dilution of the protein (low 
concentration of the protein per unit membrane surface). In 
addition to this primarily technical problem, many proteins 
are blocked in vivo by acylation or by a pyrrolidon carboxylic 
acid cap. 

The problem of partial or complete NH 2 -terminal block- 
age can be circumvented by generating internal amino acid 
sequences. This is achieved by fragmenting the protein 
present in the gel (gel in situ cleavage) or by cleaving it while 
bound to the membrane (membrane in situ cleavage) 
(33-35). In both cases, proteins are either cleaved in a res- 
tricted way (e.g., by limited enzymatic digestion or by using 
restriction chemical cleavage conditions) or fragmented into 
smaller peptides. 




Of -the different combinations examined, we had soiv 
results by using exhaustive proteolytic digestion" 
membrane-immobilized proteins. This ' method has brrr 
described for Ponceau red-stained proteins on nitroceliuio>: 
blots (34). for Amido-black^stained Immobilon-bound pn 
teins. and for fluorescamine^detected proteins on glass nbi 
membranes (35). The proteases used (trypsin. chymotrvp SUt 
or pepsin) cleave at multiple sites, generating small peptide 
that elute from the blot into the digestion buffer from which 
they are purified by reversed-phase high performance liquid 
chromatography (HPLC) before being sequenced individu- 
ally. Although each of these manipulations could be expected 
to result in a reduced yield of final sequence information, we 
were surprised that the peptides could be sequenced with 
high efficiency. In our hands, this approach could be rou- 
tinely applied to gel-purified proteins available in amount 
ranging from 5 to 10 jig, and often yielded sequence informa- 
tion covering more than 307c of the total protein. As 
membrane-immobilized proteins are not homogeneouslv 
digested, but rather show protease sensitivity next to resis- 
tant regions, the number of peptides generated is much lower 
than expected from the number of potential cleavage sites. 
Consequently. HPLC peptide chromatograms are less com- 
plex and most peptides can be recovered in pure form. 

As only limited amounts of a protein mixture can be 
loaded on a 2-dimensional gel. proteins of interest are often 
obtained in yields insufficient for the currently available se- 
quencing technology. More material can be obtained by en- 
riching for a certain subcellular fraction (purified cell or- 
ganelles) or by exploiting affinity (dyes, metals, drugs, etc) or 
hydrophobic properties of proteins before gel analysis. All of 
the sequencing results accumulated so far in the human pro- 
tein database (20) (a few are shown in Fig. 2H) have been 
obtained from analysis of protein spots collected from 
2-dimensional gels that had been stained with Coomassie 
blue according to standard procedures and dried for storage. 
Proteins are recovered from the collected gel pieces by a 
protein-elution-concentration device, combined with gel 
electrophoresis and electroblotting. Details of this technique 
have been reported in a previous communication (42) and a 
brief outline is given below. 

Combined gel pieces are allowed to swell in gel sample 
buffer (a total volume of 1.5 ml). The gel pieces combined 
with the supernatant are then collected into a large slot made 
in a new gel. The slot is further filled with Sephadex G-10 
equilibrated in gel sample buffer. During consecutive gel 
electrophoresis, most of the electrical current passes on the 
side of the slot instead of passing through the slot. This 
results in both a vertical stacking and horizontal contraction 
of the protein band. With this device the protein is efficiently 
eluted from the gel pieces and concentrated from a large 
volume into a narrow spot. The highly concentrated (about 
5 mm 2 ) protein spot is then electroblotted on PVDF- 
membranes, stained with Amido black, and in situ digested 
with trypsin. The peptides generated during digestion elute 
from the membrane into the supernatant, and can be sepa- 
rated by narrow bore reversed-phase HPLC and collected in- 
dividually for sequence analysis. 

Using this and previous procedures (37, 39, 42), we have 
so far analyzed 70 protein spots collected from 
2-dimensional gels (20, and unpublished observations) (see 
for example Fig. 2H). The sequence information amounts to 
2100 allocated residues corresponding to an average of 30 
residues per protein spot. So far we have made cDNAs of 
many of the unknown proteins that have been microse- 
quenced, and a substantial number has been cloned and se- 
quenced. All available information indicates that it may be 
possible to obtain partial sequence information from most of 
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the proteins that can be visualized by Coomassie Brillant 
Blue staining. 

Partial protein sequences are stored in the database as dis- 
played in Fig. 2H, and it should be possible in the near fu- 
ture to interface this information with forthcoming D.\A se- 
quence data from the human genome project. In the long 
run. as the human genome sequences become available it 
will be possible to assign partial protein sequences to genes 
:or which the full DNA sequence and chromosomal location 
are known (Fig. I). 

SUMMARY 

The studies presented in this brief review are intended to 
demonstrate the usefulness of computer-aided 2-dimensional 
gel electrophoresis and microsequencing to analyze cellular 
protein patterns, and to link protein and DNA information. 
\s more information is gathered worldwide, comprehensive 
latabases will depict an integrated picture of the expression 
ievels and properties of the thousands of proteins that orches- 
trate most cellular functions. 

Clearly, databases allow easy access to a large bodv of data 
and provide an efficient medium to communicate stan- 
dardized protein information. In the future, databases will 
foster a wide variety of biological information that can be 
used to support collaborative research projects in basic and 
applied biology as well as in clinical research (2. 5. 46). Once 
a protein is identified in a particular database all the infor- 
nation gathered on it can be made available to the scientist. 
However, many problems must be solved before protein 
databases become of general use to the scientific communitv. 
A most urgent one is to promote standardization of the gel 
running conditions so that data produced in a given labora- 
tory may be used worldwide. Surprisingly, the gel running 
technology as it stands today is still a craftmanship art. 

Finally, comprehensive, computerized databases of pro- 
teins, together with recently developed techniques to 
microsequence proteins, offer a new dimension to the studv 
of genome organization and function (Fig. 1). In particular, 
human protein databases may become increasingly impor- 
tant in view of the concerted effort to map and sequence the 
entire human genome. This formidable task is expected to 
dominate biological research in the next decades. 

We would like to thank S. Himmelstrup Jorgensen for typing the 
manuscript and O. Sondcrskov for photographv Work in the 
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Nonenzjmatic extraction of cells from clinical tumor 
material for analysis of gene expression by two- 
dimensional polyacrylamide gel electrophoresis 

)!^l aVe """P^? difre «ni methods of preparation of malignant cells for 
uchdimensional electrophoresis (2-DE). We found all methods usine fresh 
tissue to be superior compared to methods using frozen tissue. Our "results 
inoicaie that nonenzymatic methods of preparation of tumor cells, including 
SSL 0 ? tsp . lfauon - wiping and squeezing, have advantages over methods 
t "SSffrftlT^ ° f . CCllS ' N "«TO«ic methods^ rapid, appear 
I? g , h molecular prolein s P«ies. and alleviate the necessity of 
22^2^ a " d k nonv * abi %" I i? b >- PteoU gradient centrifugation. Usine 
lunc a rf k high •quality 2-DE maps were derived from tumors of xhl 
non muli'f^ ' '""l 1 "* po, - v P c P lid ' P««nis. heat shock proteins. 

Ji?.h1 ,k lr °P°niyosins and intermediate filament were identified. We con- 
.mnVJi 3 1 .k one ^?"! c ^traction of malignant cells from fresh tumor tissue 
unproves the possibilities that these techniques may be useful in clinical diag- 



1 Introduction 

Tumors may develop by a number of different mechan- 
isms m any given cell type. At the time of diagnosis 
tumors will have progressed along different pathways to 
various stages of malignancy. To provide a basis forindi- 
vidual therapy it is of importance to examine specific 
properties of the tumor cell population in each patient 
A large number of different markers have been de- 
scribed in order to increase the diagnostic accuracy. It is 
likely that a combination of serveral markers is needed 
in the future in order to reflect different properties of 
the tumor. One important method for the resolution of a 
large number of potential markers is two-dimensional 
electrophoresis (2-DE). Extensive efforts are being made 
m identifying various polypeptides separated by 2-DE 
and to characterize how the expression of these polypep- 
tides is affected by the response to cellular transforma- 
tion and various culture conditions (1.2). It would be of 
value to transfer this information to 2-DE separations of 
polypeptides from tumor tissue samples. However, one 
prerequisite is thai the quality of the 2-DE gels from 
tumor samples is comparable in quality with 2-DE eels 
from samples of cultured cells. 

Frozen tumor tissues are commonly used for various bio- 
chemical assessments. However, if such samples are ana- 
lyzed by 2-D polyacrylamide gel electrophoresis (PAGE) 
ihe polypeptide patterns are obscured by contamination 
of serum- and connective tissue proteins. Such nontu- 
rnor-cell-related variations represent serious problems in 
the interpretation and inter-patient comparison of 2-DE 

Correspondence: Dr. Bo Franien. Division of Tumor Pathology 

in?*."?™ 1 ° f ftuho| W- U:01 - Karoltnskj Hospital and Insure! 
Stockholm 60. Sweden 
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™ing cell nuclear amrcen; PIH. proicase inhibiiors: PMSF, phenyl- 
* Ul,onyl lluonde: SDS * i0dl "m dodecvl sulfate: WW, wet 



patterns [3]. 2-DE patterns of cells prepared from fresh 
tumor material were analyzed after enzvmatic extraction 
of tumor cells [4. 5] or after culturing tumor fragments in 
medium containing radioactive amino acids [6]. These 
procedures may, however, lead to alterations in the gene 
expression/polypeptide patterns. We are only aware of 
one study where nonenzymatic extraction of cells from 
fresh tumor tissue (prostate cancer) was used to prepare 
samples for 2-D PAGE [4]. We have examined enzymatic 
extraction and various nonenzymatic preparation tech- 
niques, including fine needle aspiration, for the prepara- 
tion of cells from fresh tumor tissues. We describe 
nonenzymatic extraction procedures that are rapid, lead 
to high-quality 2-DE patterns, and that alleviate the 
necessity to purify tumor cell populations from dead 
ceils. 

2 Materials and methods 

2.1 Cell cultures and samples used for spot 
identification 

A rat embryonal fibroblast ceil line. WT2 (a kind gift 
from Dr. J. I. Garrels and Dr. S. Pattersson) was used for 
the identification of a number of heal shock and struc- 
tural proteins. Human normal diploid lung fibroblasts. 
WI38. human epithelial breast carcinoma cells. MDA- 
231 and MCF-7 were purchased from ATCC and grown 
as recommended. Polypeptides prepared from a leu- 
kemia type pre-B-ALL were separated by 2-DE. The 
2-DE map was then analyzed by Dr. S. M. Hanash (Uni- 
versity of Michigan. Ann Arbor, USA). 

2.2 Tumor tissues samples 

In this study. 2-DE maps from seven tumors were used 
as representative illustrations: two adenocarcinoma of 
the lung (LA. and LB. mucinous, both cases interme- 
diate grade of differentiation), one sqamous carcinoma 
of the lung (LS), one carcinoid-like breast cancer (BC). 
one microfollicular adenoma (highly differentiated) of 
the thyroid (TA). one highly differentiated hyperneph- 
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roma. a tumor of the kidney (KH). and finallv one case 
of poorly differentiated corpus carcinoma (CP). 

2J Preparation of cultured cells 

The cell monolayers were washed twice in phosphate 
buffered saline (PBS) and then scraped off in ice-cold 
PBS including protease inhibitors (PIH). phenylmethvl- 
sulfonyl fluoride (PMSF) 0.2 idm and 0.83 m M benzami- 
dine pelleted at 660 x g. 3 mm (+4"C) and washed one 
time before final centrifugation at 2700 X g. 5 min The 
wet weight of the cell pellet was recorded and the cells 
were stored at -80X until further processing. 

2.4 Preparation of tumor tissue samples 

2.4.1 General remarks 

Macroscopically representative and non-necrotic tumor 
tissues were selected within 20 min after resection 
Parallel samples were routinely prepared for cvtology 
The samples were processed as rapidlv as possible on ice 
or at +4"C and in the presence of PIH. Cells were 
stained with DiffQuick (Baxter) and usually examined at 
three different occasions during the preparation proce- 
dure: (o cytology sample, (ii) extracted cells and (iii) 
cells after percoll gradient centrifugation. 

2.4.2 Specimen acquisition 

The strategy of sample preparation is shown in Fig 1 
Tumor tissue cell samples were usually obtained bv fine 
needle aspiration (NA) using a 0.7 mm needle' The 
syringe was filled with 1-2 mL of ice-cold culture med- 
ium/PIH. We found that if a tumor appeared to be very 
fibrous it is difficult to extract enough cells for 2-DE 
analysis. In these cases, iwo alternative techniques were 
examined, (i) The tumor was cut in the middle and the 
fresh surface scraped (SO by a scalpel. The cell-rich 
material was then transferred to ice-cold culture 
medium CL15 with 5% fetal calf serum)/PIH. (ii) a pan 
of the tumor sample was placed in culture medium on 
ice for further processing at the laboratory in the fol- 
lowing way: the material was cut into very small frag- 
ments on a pre-cooled dissection plate and transferred 
to a small glass chamber with a 0.7 mm metal net 5 mm 
above the bottom of the chamber. Medium /PIH was 
added to cover the sample (8 mL) which was gently 
squeezed (SQ) towards the net in order to release and 
wash out cells. NA and SC were also compared with an 
enzymatic extraction (EE) procedure described previ- 
ously [5): Briefly, thin slices of tissue were incubated 
with collagenase (1 mg/mL) and elastase (2 mg/mL) in 
medium for 1 h at 37«C. Extracted cells from even- 
sample were then subjected to percoll gradient centrifu- 
gation (Section 3.2.3). 

* 

2:4 J Separation of cells by Percoll gradient 
centrifugation 

The cell suspension was filtered through two nylon mesh 
fillers. (1) 2:>0 urn and (ii) 100 urn and then cemrifuged 



at 660 Xjr for j min. The cell pellet was resuspend- 
carefully in medium, usine a svrince and loaded onto" 
two-step discontinuous Pereoll/PBS cradient \\ a " 
(density ■ 1.03 g/mLi and W» idensitv m LQ~ C /mL< 
and centnfueed at 1000 X v for 15 mm. In this slsterr. 
dead cells stay on the top. viable cells sediment to the 
interphase and erythrocytes sediment to the bottom. Jht 
viability of cells in the top fraction and interphase was 
checked by the trypan blue exclusion test. The inter- 
phase cell layer (> 90<V wabiiitv) was collected and 
washed one time in a large volume PBS/PIH (cemri- 
fuged at 800 X jl- for 3 min). Finallv. the cells were resu<- 
pended in 1.4 mL PBS and pelleted at 2700 X v for « 
mm. The wei weight <WW» was recorded and the pellet 
was then stored at -80 C. 

2.4.4 Final preparation of cells for 2-D PAGE analysis 

From this point, cultured cell samples were treated 
m the same way as tumor cell samples: Each cell pellet 
was thawed on ice and resuspended in I.S9 uL mO water 
per mg WW <« i. 8 o x WW, uL . The suspension was 
irozen and thawed 4-5 X to break the cells 171 A 
volume of (0.089 X WW, U L 10% sodium dodecvl 
sulfate (SDS). including 33.3".. mcrcapiocthanol. was 
Ullh lh ° samp,c and "neubaicd 5 min on ice with 
(0.^29 X WW) uL of a solution of DNasc I (0 144 
mg/mL 20 nm Tris-IICl with 2 niM CACI, X 2IKO pH 
8.8) and RNase A (0.0718 mg/mL Tris) |S.9J.The sample 
was frozen and lyophilized. Sample buffer (10) including 



COW TWQLOf 

WfcstNTATivrrr 



! ( 



N0N4NZVMATIC 
EXTRACTION 



SC 



OffVMATK 
tXTJUCTlON 



SO 



2-OTAGf 





\ I 

Ftvurc 1. Experimental flim chan showing main sicps of the prepare 
uon procedures The abbrev.auons used Tor nonenzvmalic extraction 
procedures are: FZ: frozen sample preparai.on; N A needle aspir.. 
Jion; SC. scraped: and SQ. squeezed sample. Extracted cells are then 
5uspcns,on U °P vo,u ™ of each tube) onto either 
1.07 g/mL Percoll (lert,. or a discontinuous Percoll gradient from the 
««r«iion imiddlei. or rrom enzymatic extraction 
(nthn. Cellular top- and interphase fractions are then used for 2-DE 
For details see Section 2. 
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P J1, S r ft (0 - 2 mM ' EDTA n o nMi. 0.5«o Nonidei P-40 
(NP-40.. and 3-[3-chola mi do propyl )-dimeihS 0 n^ 
J-propane sulfonate (CHAPS; 25 him) was'added ™L 
fully. m,xed for 2.5 h and cemrifuged for 15 miS a, 
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2.43 Preparation of frozen tumor tissue 

The technique has been described previously [3.12]. 
Briefly, the sample is moaned frozen to a fine powder, 
homogenized, lyophilized and solubilized in sample 
buffer. 

2.4.6 Control of representativity 

The tumors were examined routinely by experienced 
pathologists and smears or imprints from the samples 
were also assessed for cytometric DNA content by 
microspectrophotometrv. 

2J 2-D PAGE 

2-D PAGE was performed as described (8.10) except for 
the following details. The glass tubes for IEF. 1.2 X 200 
mm. contained 2.0% Resolyie. pH 4-8 (BDH) and were 
cast to a height of 180 mm. A stock solution of acryl- 
amide fServa) and A'.A'-methylenebisacrylamide (16.7:1 
for IEF and 37.5:1 for the second dimension) was deio- 
nized by mixing with 5% w/v Duolite MB 5313 mixed- 
resin ion exchanger (BDH) for 30 min. filtered (with a 
0.22 um nitrocellulose filter) and siored at -70°C. 
A'.A'-Meihylenebisacrylamide. A'.A'.A\N'-tetramethyleth- 
ylenediamine (TEMED) and ammonium persulfate were 
purchased from Bio-Rad. IEF tubes were prefocused at 
200 V in 60 min. To each tube a sample corresponding to 
20-40 ug protein was applied and focused for 14.5 h at 
800 V and finally 1.0 h at 1000 V using a Protean II cell 
(Bio-Rad) and Model 1000/500 Power Supply (Bio-Rad). 
The tube gels were finally extruded into 1.25 mL equili- 
bration buffer, containing 60 mM Tris. pH 6.8 (2% SDS, 
100 mM dithiothreitol and 10% glycerol), frozen on dry 
ice and stored at -70°C. The second dimension (1.0 X 
180 X 90 mm) of the acrylamide concentration was 10% 




T. and the gel contained 37b mv Tris. pH S.S. and tu 
S ? S * ?e,s werc a PP!ied on iop of the slab gel. itjicc 
with 0.5% agarose containing electrophoresis running 
buffer (60 mM Tris-base. 02 m glycine and 0.I 1 ". SDS". 
and electrophoresed with 10-lfmA per gel < const am 
current) at +10T. Six gels were run together in a Pro- 
tean II xi 2-D Multi-Cell (Bio-Rad). Proteins were visual- 
ized by silver staining and photocraphed with the acidic 
side to the left [13.14]. 

2.6 Identification of polypeptides 

Vimentin and vimentin-derived polypeptides were identi- 
fied by extraction of an MDA-231 cell lysate w ith O.h m 
KCl/0.5% NP-40 (15). Tropomvosins were exctractcd 
from MDA-231 and WI38 cell lysates [16). and cvtokera- 
tins were extracted from MDA-231 and MCF-" cell 
lysates [17]. The patterns were compared with published 
maps [19-21]. Proliferating cell nuclear antigen (PCNA) 
was identified by immunoblouing (PC 10 mAB. Dako- 
patt) using a semidry system (Multiphor 11 Nova Blot. 
Pharmacia-LKB Biotechnology AB) and enhanced che- 
moluminescence (ECL) detection (Amcrsham). 

3 Results 

3.1 2-DE of samples prepared from normal and 
tumorigenic cultured cells 

The object of this study was to develop methods for pre- 
paration of 2-DE maps from human tumor tissue which 
have the same high resolution as those obtained from 
cultured cells. Shown in Fig. 2 are high resolution 2-DE 
gels prepared from cultured cells and one leukemia: 
SV40 transformed embryonal rat fibroblasts WT2 (Fig. 
2a); human MDA-231 breast carcinoma cells (Fig. 2b); 
human WI38 fibroblasts (Fig. 2c) and human pre B-ALL 
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cells (Fig. 2d). Polypeptides were identified through a 
laboratory exchange of cell samples/2-DE maps and 
through 2-DE analysis of purified proteins (Table 1). 

3.2 Preparation of samples from solid tumors 
3.2.1 Fresh versus frozen tissue 

An adenocarcinoma of the lung (LA) was prepared for 
2-DE by conventional methods using frozen material 
(Fig. 3a). There are several possibilities for the poor reso- 
lution using frozen tissue, including the presence of high 
molecular weight protein aggregates. Filtering extracts 
through 0.1 urn filters (Durapore. Millipore) resulted in 
a slightly improved resolution (not shown). When fresh 
tumor tissue from tumor LA was used for sample prepa- 
ration, using fine needle aspiration to collect the cells, 
the resolution was considerably improved (Fig. 3b). The* 
use of fresh tissue resulted in a general increase in reso- 
lution, which was most pronounced in the 50-100 kDa 
molecular mass range. A number of differences in the 
protein profiles of the gels in Figs. 3a and 3b can be ob- 
served, some of which are indicated in the fieures. The 
decrease in serum albumin in Fig. 3b is likely to result 
from ioss of serum proteins occurring when cells were 
pelleted after aspiration. Other differences, such as the 
decreased Jevel of transformation-sensitive tropomyosins 
(TM1-TM3). may result from enrichment of tumor cells 
in the sample of Fig. 3b. Fine needle aspiration, a well- 
established technique in cytology, extracts mainly tumor 
cells because of decreased intercellular adhesiveness of 
neoplastic cells as compared to normal tissue. Micros- 
copic examination of DifT-Quick-siained extracted cells 
from case LA revealed almost 100% tumor cells, 
whereas the whole tissue extract contained approximate- 
ly 60°n tumor cells. 
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T*M« 1. Names and abbreviations for identifie s »noi> 
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A Acuns 

aA fl/pAtf-Actinin 

B23 Protein BX3 /Numatnn 

EF2 Elongation factor 2 

EFI Elongation factor 1 6 

GT G lutathi one- S-t ran sph erase {pi 

hsp60 Heat shock protein 60 

hsp73 Heat shock protein 73 

hsp80 Heat shock protein 80. GRP'8. BIP 

hsp90 Heat shock protein 90 

hsplOO Heat shock protein 100. Endoplasmin 

I Fa Intermediary filament associated 

k8 Cvtokeratin 8 

LamB Lamm B 

Lip] Lipoconin I 

Lip2 Lipoconin II 

LipS Lipoconin V 

Mill Mitcon 1/3 - Fl ATPase 

Mit5 Mitcon 2 

Mitf Mitcon 3 

MRP Mucine Related Polypeptides 

pcna Ploliferating cet! nuclear antigen 

PLC Phospholipase C (1) 

RO RO/S5-A antigen 

SA Serum Albumin 

aT a/pAo-Tubulin 

bT 6*f>i0-Tubuiin 

tml Non-muscle tropomyosin i so form 1 

tm2 Non-muscle tropomyosin isoferm 2 

tm3 Non-muscle tropomyosin isolerm 3 

tm4 Non-muscle tropomyosin isot'orm 4 

tm5 Non-muscle tropomyosin i so form 5 

TPI Those phosphate isomerase 

V Vimentin 

Vidl Vimentin derived protein 

Vid2 Vimentin derived protein 

Vid3 Vimentin derived protein 

Vid4 Vimentin derived protein 

Vin Vinculin 
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122 Comparison or different methods for preparing 
cells from fresh tumor tissue 

Samples were prepared from breast and lung carcinomas 
using either an enzymatic treatment with collagenase/ 
elastase or using nonenzymaiic preparations (Fig. 4). a 
number of differences in the protein profiles were ob- 
served in the resulting 2-DE gels, some of which are 
indicated in Figs. 4a and b. These differences include 
both increases and decreases in spot intensity. These dif- 
ferences may result from degradation of high molecular 
weight polypeptides during enzymatic treatment, in- 
creased solubilization of polypeptides, or may have other 
causes. For many tumors, it was only possible to obtain 
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small amounts of material since thev uer; rescue J To- 
other examinations. In these cases, samples could be pro- 
pared for 2-DE using either needle aspirauon or 
^scraping. Figure 5a shows a 2-DE gel prepared from 
squamous lung carcinoma <LS) cells collected h> needle 
aspiration and Fig. 5b shows a eel prepared from the 
same tumor by scraping. In this case, a number of differ- 
ences were recorded between the two procedures, some 
of which are arrowed in Fig. 5. Samples obtained from 
other tumors (breast and lung) generall> showed fewer 
differences between these two methods of cell sampling 
(not shown). These data show that different nonenzv- 
matic extraction procedures may vield different polvper- 
ude patterns. However, the number of spots with a "lame 




SA 



.* r. 



II 



^"T.LL^.r^r ° f 3 of,u "* » neer ILSl Comparison of 2-DE ¥ c. quahn „nd dc.cc.cd s,„>.s „m>« hed, .„u. crclcs. hc,»e 
(Ai aspirated ( needle aspirauoni and (Bi scraped preparations from fresh i.smc. ' 



MP 100 

\ 



* - . SA 




B 

i H hsplOO * 



f + 



GT 



-SA 




* " m GT • 



n£?nZJ£?«? tly%i * ° f ° !her lyP " ° f lUm0fS * hy P crnc P hroma - ™ -enonu. of the thyroid and (C. corpus cancer, using the 
nonenz>matic preparation technique. Arrowheads and circles indicate some cytusol.c polypeptides. 



Eiectm?*vmis 1993. 14. 1045*105) 



Preparation of human tumor* for ifu!>si» b\ :-D eie.trernore.u 



1051 



difTerencc in intensity were lower than when a nonenzy- 
matic preparation was compared with an enzymatic pre- 
paration. 

2-DE maps or satisfactory quality were prepared by a 
third procedure. Cells were released from small pieces of 
tumor by squeezing (see Section 2). Some examples of 
this are shown in Fig. 6 where 2-DE maps derived from 
a case of hypernephroma. KH (Fig. 6a). a case of thyroid 
tumor. TA (Fig. 6b) and a case of corpus cancer, CP (Fig. 
6c) can be seen. We conclude that nonenzymatic tech- 
niques are useful for 2-DE analysis of a number of dif- 
ferent tumors. The quality of the resulting gels is com- 



parable to that obtained using cultured cells (compare 
the gels in Fig. 2 with those in Fig. 4. 6 and Whtcn of 
these methods will be optimal will, in our experience, 
depend on the tumor material. For example, very small 
tumors are preferably extracted by squeezing: on the 
other hand, breast cancers (which are often fibrous) 
yield satisfactory* samples using scraping. 

3.2.3 Purification of cells on percoll gradients 

We considered the possible advantage of separating 
viable cells from dead cells, erythrocytes, and debris 
using discontinuous Percoll gradients. Cells collected 





Figure * 2-DE analysis of polypeptides from viable <b and d) and nonviable (a and c) celts of an adenocarcinoma ol the lung (LB), 
separated using discontinuous Percoll density gradient. Nonenzymatic preparation technique U and b» and enzymatic preparation 
technique ic and di arc compared. 
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with these observations (Fig. 8). A number of potential 
t and interesting markers, like tropomyosin isoforms. cyio- 
4 keratins and heat shock proteins, appear to be insensi- 
tive to loss of viability during the preparation procedure. 
We have to date made numerous observations of altera- 
tions in the expression of these polypeptides in breast 
cancers and lung cancers. 

Another problem that may occur, irrespective of sample 
preparation techniques used, is admixture of lympho- 
cytes. These cases are easily detectable in smears and it 
may therefore be possible to select lymphocyte specific 
spots as "internal markers" for the 2-D PAGE analysis. 
Studies using this approach are in progress. Many of the 
polypeptides identified are structural (Table 1). Since the 
expression of many of these polypeptides are known to 
vary between normal and malignant cells, the possibility 
to determine their expression simultaneously is 
appealing. In the specific case of breast cancer, altera- 
tions in the expression of intermediate filament proteins 
(cytokeratinsi are known to occur during tumor progres- 
sion (23). Other proteins known to be differentially 
expressed between normal cells and transformed cells 
arc tropomyosins. numatrin/B23. heat shock proteins 
and PCNA. To this end. we have observed alterations in 
the expression of cytokeraiin 8. hsp 90. and non-muscle 
tropomyosin isoform 2 during malignant progression. 
(Okuzawa et a/. t in preparation and Franzcn et al.. in pre- 
paration). 

The method of choice for sample preparation from 
tumor tissues will depend on the properties of the tumor 
material studied. It may be important to use only one 
method when comparing cases within one group, as dif- 
ferences were observed between methods. The advan- 
tages of the nonenzymatic techniques arc (i) that it mini- 
mizes contamination with connective tissue, (ii) that 
problems with contamination of scrum proteins are 
avoided, and tiiu that separation of viable and dead cells 
is not necessary. Hereby the revolving power of 2-D 
PAGE is maximized for the analysis of human tumors 
and studies on intcr-tumor variations in gene expression 
are facilitated. In addition, the polypeptide patterns ob- 
tained may be more representative for the in vivo tumor 
cell since the use of enzymes and incubations have been 
minimized. 

He wttuUI like to thank Dr. J, /. Gorrcls. Dr. S. Pattcrsson. 
Dr. S. M. Hanash and Dr. J. £. Celts tor making sample 
and H-DE map exchanges possible. Tins study was sup- 
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Reference points for comparisons of mo-dimensional 
maps of proteins from different human cell types 
defined in a pH scale where isoelectric points correlate 
with polypeptide compositions 

A highly reproducible, commercial and nonlinear, wide-range immobilized pH 
gradient (IPG) was used to generate two-dimensional (2-D) gel maps of 
[ ,: S]methionine-labeled proteins from noncultured, unfractionated normal 
human epidermal keratinocytes. Forty one proteins, common to most human 
cell types and recorded in the human keratinocyte 2-D gel protein database 
were identified in the 2-D gel maps and their isoelectric points (pi) were deter- 
mined using narrow-range IPGs. The latter established a pH scale that 
allowed comparisons between 2-D gel maps generated either with other IPGs 
in the first dimension or with different human protein samples. Of the 41 pro- 
teins identified, a subset of 18 was defined as suitable to evaluate the correla- 
tion between calculated and experimental p/ values for polypeptides with 
known composition. The variance calculated for the discrepancies between cal- 
culated and experimental p/ values for these proteins was 0.001 pH units. 
Comparison of the values by the r-test for dependent samples (paired test) 
gave a p-level of 0.49, indicating that there is no significant difference between 
the calculated and experimental p/ values. The precision of the calculated 
values depended on the buffer capacity of the proteins, and on average, it 
improved with increased buffer capacity. As shown here, the widely available 
information on protein sequences cannot, a priori, be assumed to be sufficient 
for calculating p/ values because post-translational modifications, in particular 
A'-terminal blockage, pose a major problem. Of the 36 proteins analvzed in 
this study, 18-20 were found to be /V-terminally blocked and of these onlv 6 
were indicated as such in databases. The probability of ;V-terminal blockage 
depended on the nature of the A'-terminal group. Twenty six of the proteins 
had either M, S or A as A'-terminal amino acids and of these 17-19 were 
blocked. Only 1 in 10 proteins containing other A'-terminal groups were 
blocked. 



1 Introduction 

As compared with carrier ampholyte isoelectric focusing 
(CA-IEF), the application of immobilized pH gradients 
(IPGs) in the first dimension in 2-D gel electrophoresis 
offers improved reproducibility [1] because the nature of 
the pH gradient makes the resulting focusing positions 
insensitive to the focusing time [2] and to the type of 
sample applied [3]. The recently introduced ready-made 
IPG strips [4] seem to be an ideal substitute for the car- 
rier ampholyte gradients, which until now have been the 
most commonly used first dimensions in 2-D gel electro- 
phoresis. The availability of standardized first dimen- 
sions opens the possibility of comparing 2-D gel maps of 
various cell types generated in different laboratories, pro- 
vided that the focusing positions of a number of easily 
recognizable polypeptide spots common to the cell types 
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in question are known. Even though this approach is 
limited to experiments performed with the same standar- 
dized IPG, the flexibility provided by IPGs allows the 
pH gradient to be adjusted to the requirements of a par- 
ticular experiment. 

Exchange and communication of 2-D gel protein data re- 
quires a pH scale that is independent of the particular 
IPG used and by which the results can be described. The 
introduction of carbamylation trains and the relation of 
focusing positions to the spots in these trains repre- 
sented a step forward towards solving the reproducibility 
problem experienced with carrier ampholyte focusing [5]. 
Problems associated with the use of carbamylation trains 
were mainly due to lack of temperature control and to 
the use of nonequiiibrium focusing conditions. Accord- 
ingly, the pattern variation involved not only the re- 
sulting pH gradients, but also the relative spot positions 
as related to each other and to spots in the carbamyla- 
tion trains. Even though the question of reproducibility 
has, to a large extent, been solved, the carbamylation 
trains are still not ideal as markers because the spots in 
the trains do not represent defined entities but rather a 
large number of differently carbamylated peptides 
having close p/ values. As a result, the spots are large 
and poorly defined as compared to the ordinary polypep- 
tide spots in 2-D gel maps. 
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Neidhardt etai. [6] defined the pH gradient in 2-D gel 
♦experiments by pi markers whose p/ values were calcu- 
lated from the amino acid composition. Focusing posi- 
tions of other polypeptides could be predicted from their 
composition but the pK values needed for the p/ calcula- 
tions were unknown. Various groups employing this 
approach do not use the same pK values [6. 7] and there- 
fore, the p/ values derived in this way cannot be 
expected to describe the variation of the hydrogen ion 
activity. In spite of this fact, it is still possible to make 
approximate predictions of focusing positions because 
the pK values used to define the pH gradient are also 
used to calculate pi values and to predict the focusing 
positions. Errors in pK assignments are therefore com- 
pensated. A pH scale which corretly reflects the variation 
in hydrogen ion activity during focusing should improve 
the precision of the predictions, but this has never been 
implemented with CA-IEF focusing as a first dimension 
in 2-D gel electrophoresis. The main reason for this are 
the problems associated with pH measurements in 
focused gels containing high concentrations of urea. 

IPGs can be described from the concentration variation 
of the immobilized groups, provided that the pK values 
of these groups are known for the conditions prevailing 
during focusing. To avoid measurements on gels, Gia- 
nazza etai. [8] suggested the use of p* values derived by 
addition of determined pK shifts. Recently, direct deter- 
minations of p* differences between immobilized 
groups in IPGs were made by determining pI-pK values 
in overlapping narrow-range IPGs [9, 10] and the results 
verified the applicability of the Gianazza approach. A 
description of the focusing results in a pH scale, which 
correctly describes the variation of the hydrogen ion 
activitv for the focusing conditions used, not only allows 
the comparison of 2-D gel maps generated with different 
IPGs, but also opens the possibility for correlating the 
focusing position of a polypeptide with its composition 
[9]. Experiments by Bjellqvist eial. [9, 10] have implied 
that pH scales showing good correlation between calcu- 
lated and experimental pi values can be derived for any 
of the conditions commonly used for focusing in connec- 
tion with 2-D gel electrophoresis. These pH scales are 
then defined through the pK values of the immobilized 
groups in the IPG containing gel. To be useful for inter- 
laboratory comparisons, however, the pH scale has to be 
defined through pi values of easily recognizable spots 
present in the 2-D gel map. So far, pi determinations in 
a useful pH scale, combined with determinations of pK 
values needed for pi calculations, have only been made 
for the pH range 4.5-6.5 at 10°C [9]. CA-IEF focusing as 
described by O'Farrell [11] does not control the tempera- 
ture of the first dimension, which can be expected to be 
slightly above room temperature. With IPGs, the temper- 
ature commonly used is about 20°C [4, 12] or 25 °C [13] 
and this is a critical parameter that needs to be con- 
trolled [14]. 

The present work was designed to compare 2-D gel maps 
of different cell types in a laboratory applying both 
CA-IEF and IPG focusing at a common temperature. To 
this end we have generated 2-D gel maps of proteins 
from noncultured, unfractionated normal human epi- 
dermal keratinocytes with IPG in the first dimension 



and a focusing temperature of 25 l C. We have used com- 
mercial nonlinear, wide-range IPG strips which give 2-D 
gel maps that are closely similar to the ones resulting 
with the CA-IEF technique used to establish the human 
keratinocyte database [15]. As an initial step towards 
interlaboratory comparisons of results obtained with the 
nonlinear gradient as a first dimension we report here 
on the focusing positions of 41 known proteins that are 
common to most human cell types. The pH range 
covered corresponds to the range in classical CA-IEF 
2-D gel electrophoresis and in order to use these pro- 
teins as internal standards for comparing 2-D gel maps 
generated with other IPGs we determined their pi values 
with narrow-range IPGs in the first dimension. We have 
compared the calculated versus experimental pi values 
and show that it is necessary to have further information 
(absence or presence and nature of posttranslational 
modifications), in addition to amino acid composition to 
be able to calculate pi values that correspond to the 
actual experimental values. The pA' values used for the 
calculations are provided and the usefulness of pi predic- 
tion in relation to database information is discussed. 
Furthermore, we comment on the possibility of using 
experimentally determined pi values to verify the avail- 
able database information on polypeptide composition. 



2 Materials and methods 

2.1 Apparatus and chemicals 

Equipment for isoelectric focusing and horizontal SDS 
electrophoresis (Multiphor v II electrophoresis chamber, 
Immobiline* strip tray, Multidrive XL programmable 
power supply, Macrodrive power supply and Multitemp* 
II) was from Pharmacia LKB Biotechnology AB 
(Uppsala, Sweden). Vertical second-dimensional gels 
were run in the home-made equipment described in [15]. 
The IPG strips with the wide-range nonlinear pH gra- 
dient were either Immobiline DryStrip* pH 3-10 NL, 
180 mm or alternatively 160 mm long IPG strips with a 
corresponding pH gradient. In both cases the IPG strips 
were delivered by Pharmacia LKB. Immobiline, Pharma- 
lyte, Ampholine, GelBond as well as PAG film and the 
ready-made horizontal SDS gels (ExcelGel- XL SDS 
12-14) were also from Pharmacia LKB. Purified proteins 
and peptides were from Sigma (St. Louis, MO). 

2.2 Sample preparation 

Preparation and labeling of unfractionated keratinocytes 
as well as fibroblasts have been described in [16]. Cells 
were lysed in a solution containing 9.8 m urea, 2% w/v 
NP-40^ 100 mM DTT and 2% v/v Ampholine pH 7-9. 

2.3 2-D gel electrophoresis 

First-dimensional focusing was performed according to 
Gorg etai. [2] with some minor modifications, as de- 
scribed in [9]. Rehydration of the IPG strips was made 
in a solution containing 9.8 m urea, 2% w/v CHAPS, 10 
mM DTT and 2% v/v carrier ampholyte mixture. The car- 
rier ampholyte mixture consisted of 2 parts Pharmaiyte 
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4-6.5. 1 part Ampholine pH 6-8 and 1 part Pharmalyte 
phf 8-10.5. Usually, cathodic sample application was 
'used and the samples were diluted 2—20 times in a solu- 
• tion containing 9.8 m urea. 4°o w/v CHAPS. l°o w/v 
DTT and 35 mM Tris base. For acidic application, the 
Tris-base was substituted with 100 m.M acetic acid. The 
degree of dilution and sample volume (20-100 \iL) 
depended on the particular sample and the IPG. and 
whether visualization of the proteins was to be done by 
Coomassie Brilliant Blue or silver staining. With the 
wide-range non-linear IPG. 10-30 |ig of total protein 
was loaded for silver staining and 100-200 \ig for Coo- 
massie staining. Focusing was done overnight with Vh 
products in the range of 45-60 kVh with 160 mm long 
strips and 50-70 kVh with 180 mm long strips. Solubili- 
zation of polypeptides and blocking of -SH groups prior 
to the second-dimensional run. as well as loading on the 
second-dimensional gel was done as described in [9]. 
The stacking gel was omitted and 5-10 mm were left at 
the top of the second-dimensional gel for applying the 
IPG strip. The space was filled with electrode buffer con- 
taining 0.5 °/o w/v agarose. Casting, running, staining and 
autoradiography were carried out as described in [15]. 

2.4 Experimental determination of pi values 

The determination of the pA' differences between Immo- 
biiines pA* 4.6, pA' 6.2 and pA' 7.0 necessary for the cali- 
bration of the pH scale at 25X in 9.8 m urea was done 
as described in [9] with the same narrow-range IPGs. 
The pH scale was defined by setting the pA* value of 
Immobiiine pA' 4.6 equal to 4.61 [9] and the determined 
pA' differences gave the pA values of Immobilines pA* 6.2 
and pA 7.0. equal to 5.73 and 6.54. respectively. The pA' 
differences found are in good agreement with values de- 
rived from [17] and [8] by extrapolation to 9.8 m urea 
concentration. As in [9]. additional narrow-range recipes 
have been used for determining p/ values. With narrow- 
range IPGs extending to pH values higher than the pA' 
value of Immobiiine pA' 7.0. anodic sample application 
was used with acetic acid added to the sample solution. 
Otherwise, cathodic sample application was used with 
the same sample buffer as for wide-range IPGs. 



2.5 Protein compositions used for p/ calculations 

With the exception of vimentin. protein compositions 
are from the Swiss-Prot database [18]. For vimentin. we 
used the data from [19], where the amino acid at posi- 
tion 41 is a D instead of a S. Information in the Swiss- 
Prot database on phosphorylation has been disregarded 
because it was known from earlier studies (J. E. Celis, 
unpublished results) that the spots in question corre- 
sponded to the unphosphorylated forms of the peptides. 

2.6 Calculation of p/ values 

For the p/ calculations it was assumed that the same pK 
value could be used for an amino acid residue in all 
polypeptides and in all positions in the peptide except 
for A- or C-terminally placed amino acids. For the pA' 
values of the A'-terminal amino groups the effect of the 



different substituents on the c-carbon were taken into 
account. The calculations of p/ values were made with 
the aid of the IPG-maker program [20]. 

2.7 pA' values used for p/ calculations 

For the carboxyl terminal group and internal glutamyl 
and aspartyl residues the same pK values were used as in 
[9]. For C-terminal glutamyl and aspartyl residues, sep- 
arate pK values were derived with the aid of the Tatt 
equations [9, 21], The pA' values of histidyl groups were 
calculated from the p/ values of human carbonic anhy- 
drase I as in [9]. For A r -terminal glycine a pA' value of 
7.50 was used. The pA' shift caused by a substituent on 
the a-carbon was assumed to be identical with the pA* 
shift the substituent caused for the amino group in the 
amino acid. i.e. 2.28 pH units were subtracted from the 
pA' values for the amino groups in the amino acids given 
in [22. 23]. The approximate pA' value of 9 for the cys- 
tenyl group was taken from [24]. For tyrosyl and arginyl 
groups we used the pK values for the amino acids [22. 
23]. For lysyl groups the effect of high urea concentra- 
tion on amino groups was taken into account and 0.5 pH 
units were subtracted from the amino acid pA* value. 
These last three pA' values are far from the pH range 
under study and the results found would have been the 
same if lysyl and arginyl groups were assumed to be 
fully ionized while the ionization of tyrosyl groups were 
neglected. A complete list of the pA' values used is given 
in Table 1. 



Table 1. pA' Values used for the ionizable groups in peptides 
9.8 m urea. 25 °C 



Ionizable 


pAJ 


group 




C-terminal 


3.55 


.V-terminal 




Ala 




Met 


7.00 


Ser 


6.93 


Pro 


8.36 


Thr 


6.S2 


Val 


7.44 


Glu 


7.70 


Internal 




Asp 


4.05 


Glu 


4.45 


His 


5.98 


Cys 


9 


Tyr 


10 


Lys 


10 


Arg 


12 


C-termina! side chain groups 




Asp 


4.55 


Glu 


4.75 



2.8 Statistical analysis 

Statistical comparisons of the experimental and calcu- 
lated p/ values were done on an Apple Macintosh Ilsi 
using the statistical package Statistica/Mac, release 3.0b 
(from StatSoft Inc., Tulsa, Oklahoma). Calculated and 
experimental pi values were compared by the /-test for 
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correlated samples (paired r-test). The normality of p/ 
differences was estimated graphically by probability 
plots. The variances of the data presented here and the 
similar data on plasma and liver proteins in [9] were 
compared by the F-test. 

3 Results and discussion 

3.1 Identification of polypeptides and pi determinations 

The 2-D gel maps of [ 3i S]methionine-labeled proteins 
from noncultured. unfractionated normal human kerati- 



IEF- 



a 

i 



nocytes. focused with the nonlinear, wide-range IPG .mu 
CA-IEF pH gradients in the first dimension, are shown 
in Figs. 1 and 2, respectively. The IPG extends to higher 
pH values but otherwise the two patterns are verv sim- 
ilar and most of the spots in the IPG pattern can be 
directly related to the corresponding spots in the 
CA-IEF gel. To obtain comparable patterns it was impor- 
tant to keep the focusing temperature as similar as 
possible. Compared to other studies [1-4. 9. 10. 12-14). 
we increased the urea concentration in the focusing gel 
to 9.8 m because keratins streaked badly in the focusing 
dimension when 8 m urea was used, presumably due to 
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aggregates of acidic and basic keratins. An increase in 
urea concentration to 9 m or more eliminated these 
streaks: apart from this effect, no other major changes in 
the focusing positions were observed. In Fig. 1 we have 
indicated the positions of 41 known proteins from the 
human keratinocyte 2-D eel database that are most 
likely common to most human celi types. The choice 
was made because these proteins are easy to identify 
with certainty. With the exception of stratifin (spot 2), 
invoiucrin (spot 4) and keratin 14 (spot 15), which are all 
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epithelial markers, these proteins are also present in 
human fibroblasts (Fig. 3) and lymphocytes (results not 
shown), and therefore can be used as landmarks for com- 
paring 2-D gel maps derived from different cell types. In 
Table 2 the 41 proteins are listed together with their 
sample spot numbers (SSP) in the human keratinocyte 
protein database and p/ values determined in 2-D gel 
maps generated with narrow-range IPGs in the first 
dimension. 




pf'Tpp 2 " D k gel r prolein ™P on 35 Slmeihionine-labeled proteins from noncultured. unfractionated normal human keratinoevtes focused with 
tvltr in ine ,irsl dimension. The position of the 4! proteins analyzed in this studv is indicated. 
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3.2 Comparison between the determined and calculated 
p/ values for human keratinocyte proteins 

Thirty six of the 41 proteins listed in Table 2 are found 
in the Swiss-Prot database. Contrary to the plasma and 
liver proteins used in [9]. the p/ calcuations on the pro- 
teins used in this study posed some problems that 
reflected the way in which they were characterized. The 
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proteins used by Bjellqvist et al. [9] were eithe- v-v 
abundant and weil-characterized plasma proteins or thev 
were identified by A^terminal sequencine and. therefore 
the nature of the /V-terminals (acetvlated or non-acetv- 
lated) was in both cases known. The proteins used in 
this study have all been characterized bv internal 
sequencing [7] and it is known that A'-terminal acetvla- 
tion occurs with high frequency in eukaryo'tes. 

Mrx10" 3 




i 



...... w.t • • ^iv 1 ^; • 

.> . • . . - jj • 

♦ • 30 22^ 5 

\f ^ ^ 



-125 



-90 



-55 



-43 



24% 

25 V 



37-^ 



34 

^38 



8 

I 

t 



-30 



t 

23 



T 

6 



27' 



T 

14 



7 



31 , 
♦ N9 1 



-16 



12 y ,1 



536 B. Bjeliqvisi et at. 



Eiearoonortsss 1*304. /.^ 5^j_i;o 



According to Brown and Robert [25]. proteins with acety- 
lated A-terminals correspond in weight to approximately 
80% of the soluble protein in ascites cells. Based on 
results from A-terminal sequencing, at least 40% of the 
spots in the human liver protein 2-D gel map appear to 
be blocked [3]. The corresponding number, derived from 
107 spots in the 2-D gel map of human T-lymphocvte 
proteins, falls between 60 and 65% (J. Strahler, personal 
communication). Information concerning A-terminal 
blockage is not normally available, and in the Swiss-Prot 
database only 6 of the 36 keratinocyte proteins are speci- 
fied as A-terminally blocked. We have, within the present 
material, defined 18 proteins for which the /V-terminals 
are very likely to be correctly described. Six of these pro- 
teins are listed in the Swiss-Prot database as A'-termi- 
nally blocked, four represent proteins which appear in 
the human liver 2-D gel map and have been A-termi- 
nally sequenced as liver proteins [3] and the remaining 
eight have A-terminal groups other than M. S and A, i.e. 
A-terminals for which A-acetyiation is uncommon [26]. 
in Figs. 4A. B. C and D p/ values calculated from Swiss 
Prot database information are plotted against the experi- 



mentally determined p/ values for all the keraiiruwie 
proteins listed in Table 2 and for the 18 selected pro- 
teins, as well as for the plasma and liver proteins (data 
from [9] valid for 10 °C)*. 

The calculations show that without knowledge of the 
status of the A'-terminal group, precise predictions of p/ 
values for eukaryotic proteins cannot be achieved based 
on the information available in Swiss-Prot and similar 
databases. However, for proteins where the A-terminal 
status is known, we find good correlation between pre- 
dicted and experimental p/ values. When the variance of 
the p/ discrepancies and the variance of calculated 
charges at the experimental pi values derived from the 
present data set are compared with the corresponding 



' There are four plots: (A) the 36 polypeptides from normal human 
keratinocytes (no corrections). (B) the 36 polypeptides from Fig. 4 A 
where p/ values have been recalculated for 12 polypeptides with M. 
S and A as A'-terminally assumed blocked, based on calculated 
charge. (C) the 18 selected polypeptides with information on the 
A-terminal configuration, and (D) plasma and liver proteins. 



t in-rirwnui H 



Mrrnmcmai rt . 

r.tprfimrnui pi 

™™« " , Ca ' CUlaIed "P;-"™"'*' Pl values. Lines are fitted using the leas, squares" criterion. (At 36 polypeptides from normal human keraii- 
™?n v C0rrecl | l0 i nsl - ,B > -> 6 Polypeptides from Fig. 4A (including the 18 marker polypeptides) where pi values have been recalculated 
assuming A-terminal blockage: x indicates recalculated pi values: nucleolar protein B23 is indicated with an arrow. (C) 18 polypeptides with infor- 
mation on N-termmal configuration and (D) plasma and liver proteins. 



Electrophoresis 1994. !5. 529-539 



Reierence points for compar^^Pof 2-D gei maps 



valufes derived from the data on plasma and liver pro- 
teins in [9] (Table 3), the present data are found to result 
, in larger variances for the values of both p/ discrepancies 
and calculated charge at the experimental p/ value when 
no information on posttranslational modification is 
taken into consideration. Correction for possible .V-acety- 
lation of 12 polypeptides with M. S and A as A-terminal 
results in a smaller variance of p/ discrepancies, al- 
though not significantly different from values derived 
from [9], whereas the variance of the calculated charge at 
the experimental p/ value is significantly higher. For the 
18 selected proteins the variance for the p/ discrepancies 
is significantly smaller than for the data in [9]; however, 
the corresponding value for calculated charge at the 
experimental p/ value does not improve to the same 
extent. This, we believe, reflects another difference 
between the two sets of proteins used for the calcula- 
tions. Based on spot distributions in 2-D gel maps, the 
set of proteins used here has a molecular\veight distri- 
bution that is more representative of the patterns ob- 
served in mammalian cells. In the study by Bjellqvist 
etai [9] most of the high molecular weight plasma pro- 
teins had to be excluded due to their unknown content 
of sialic acid which made the proteins analyzed in this 
study heavily biased towards low molecular weight pro- 
teins. The buffer capacity of proteins normally increases 
with the protein's molecular weight, and the average 
buffer capacity of the presently selected proteins with 
assumed known A'-terminais is 18 charge units/pH unit, 
while the corresponding value for the proteins used in 
[9] is only 9 charge units/pH unit. High buffer capacity 
can be expected to improve the agreement between cal- 
culated and experimental p/ values. Inspection of the 
data presented in Table 2 for the polypeptides with 
assumed known A'-terminals verifies the importance of 
the buffer capacity. For 8 polypeptides having buffer 
capacities higher than 15 charge units/pH unit, the calcu- 
lations in all cases yielded p/ discrepancies with absolute 
values of less than 0.02 pH units. The largest discre- 
pancy, 0.06 pH units, was observed for annexin II and 
stathmin. proteins which have low buffer capacity: 0.9 



and 6.6 charge units/pH unit, respectively. The proba- 
bility that the focusing position of a protein with known 
composition will fall within a certain distance from the 
calculated p/ value therefore cannot be predicted by the 
variance alone. The buffer capacity of the specific protein 
must be taken into consideration as well. As indicated 
by the decrease of the variance of calculated charges at 
the experimental p/ value for the selected proteins, the 
observed improvement can not solely be due to the 
higher buffer capacity of the keratinocyte proteins. The 
two studies relate to different experimental conditions. 
Good agreement between experimental and calculated 
p/ values implies that the proteins are defolded and a 
factor that may contribute to the observed improvement 
is a more complete defolding of proteins caused by the 
higher temperature and urea concentration used in this 
study. 

The data indicated that the precision with which pi 
values can be predicted for polypeptides with high buffer 
capacity is better than the precision with which experi- 
mental p/ values can be determined. If the pH is defined 
through the pK values of the immobilized groups in the 
IPG containing gel, the precision of the experimentallv 
calculated data will depend on the pH difference 
between the pi and the pA' value of the immobilized 
group with the closest pA". For the present study this will 
give pi determinations with a precision varying in the 
range of ± 0.02-0.05 pH units [9]. The good' agreement 
observed between the calculated and experimental pi 
values is due to the fact that errors are mainly system- 
atic and, as discussed in [9], they will largely be cancelled 
out in the calculations. A pH scale defined through the 
presently determined pi values will not necessarily 
reflect the variation of the hydrogen ion activity during 
the focusing step in an optimal way, but it still allows 
precise predictions of focusing positions for polypeptides 
with known compositions, including information on 
posttranslational modifications. Calculated net charge at 
the experimentally found isoelectric point defined in this 
scale will serve as a tool to verify that the polypeptide 



Table 3. Mean values and variances for Ihc diirerence (experimental peculated pf) in P H units and calculated charges at the experimental pi 

values, respectively y 



F-value ip/ discrepancy)"" 
P-!evel ipl discrepancy) 51 
Calculated charge at the 
experimental p/ value 
F-vaiue (calculated charge 
at the experimental p/ value ) J 
P-level (calculated charge 
ai the experimental pi value ) b 





Plasma and liver 

proteins 
(8 m urea. 10°C) 




Keratinocyte proteins 
(9.8 m urea. 25 V C) 








All peptides 


All peptides alter 
correction for 
/V-acetylation 


Known AMerminal 
configuration (or 
very likely configuration) 


Number of proteins 


29 


36 


36 


18 


Experimental p/- 
caiculated pi 


Mean Variance 
-0.011 0.005 


Mean Variance 
0.072 0.017 


Mean Variance 
0.019 0.003 


Mean Variance 
0.005 0.001 



-0.070 



1 

0.5 



0.5 



0.227 



3.4 
0.0005 



0.321 



0.871 



3.8 



0.0002 



1.67 
0.0721 
0.009 0.444 

i.96 

0.0338 



a) Comparison to the data in [9]. F = Sft S J. where S, 3 is the larger of the two variances 

b) PiFiv x . \'2 ) > lvalue), where v } and i- : are the degrees of freedom for 5, and i 2 , respectively 



3 

0.0004 
-0.014 0.109 

2.08 

0.0536 



538 B. Bjellqvist tt ai. 




composition used in the calculation is correct and com- 
plete. Exceptions to this are proteins such as involucrin 
and heat shock protein 90 that have very high buffer 
capacities. Introduction of an extra charge unit into 
these proteins will only result in pi shifts falling in the 
range of 0.01-0.02 pH units and the effect is that the 
quality of the pH definition — the precision by which pK 
values used in the calculations are given and the preci- 
sion of experimental p/ values in these cases - will limit 
the possibilities to verify polypeptide compostion based 
on the experimental p/ value. 

Statistical comparison of experimental and calculated p/ 
values was done using the Mest for dependent samples 
and normality of the discrepancies was estimated by 
probability plots. For the 36 proteins, the p-level is 
0.0021. indicating that a result like this is unlikely to 
be a chance effect and must be assumed to represent a 
real difference. After correction for the most likely 
^-terminal configuration, the /7-level is 0.043 and cannot 
be accepted as representing the same population since 
the p-level is less than 0.05 — the traditional p-limit of 
statistical significance. For the 18 proteins with a known 
or very likely A-terminal configuration the Mest gave a 
/?-leveI of 0.49, which verifies that the experimental and 
calculated p/ values are not significantly different. 

Besides showing that p/ values for denatured proteins 
with known compositions can be calculated with a high 
degree of precision from average pA' values, the results 
also provide strong support for the notion that 
A-terminal blockage heavily depends on the nature of 
the A-terminal groups [26]. The results seem to indicate 
that with A'-terminals other than M. S and A, only a few 
proteins have blocked A-terminals (1 out of 10 proteins 
in the present study), while it can be inferred from the 
data presented in Table 2 that a majority of the proteins 
with M. S and A as A-terminal are blocked. After correc- 
tion for the effect of suspected A-terminal blockage 
there is only one protein (nucleolar protein B23) out of 
the 36 used in this study, which, in spite of a high buffer 
capacity, has a marked difference of 0.11 pH units 
between predicted and determined p/ values (Fig. 4B); 
this corresponds to 3 charge units due to the high buffer 
capacity of this protein. This discrepancy in p/ prediction 
and calculation of net charge at the p/ is probably not 
due to deficiencies in the database information but 
instead reflects a shortcoming of the model used for p/ 
calculations. Nucleolar protein B23 contains a domain 
extremely rich in aspartic and glutamic acid residues 
(Table 4), in which 26 out of 28 amino acid residues 
from position 161 to 188 are either a D or an E. A calcu- 
lation based on the use of average pK values unin- 
fluenced by the charged neighboring amino acid resi- 
dues cannot be expected to correctly describe the pi 
value with almost half of the acidic groups packed 



Table 4. Amino acid sequence of nucleolar phosphoprotein B23 
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together into a highly negatively charged region. This 
limitation caused by calculations based on average 
values does not severely limit the usefulness of the 
approach since a search through Swiss-Prot shows thai 
this type of D/E-rich motif is uncommon, and the exis- 
tence of a highly charged region is immediately apparent 
upon inspection of the amino acid sequence. 

The quality of the information available in databases, 
especially concerning posttranslational modifications, is 
a major problem when the data is to be used for pi pre- 
dictions. The /?-level of 0.043 found for all 36 proteins 
after correction for A-acetylation. shows that this prob- 
lem is not only limited to A'-terminal blockage and the 
very good agreement found for the eighteen pol\ pep- 
tides, with assumingly correctly described A'-terminal 
(Fig. 4C). must be regarded as an exception from this 
point of view. A -Terminal blockage is generally the main 
problem in relation to pi predictions for eukaryotic pro- 
teins. Of the 36 keratinocyte proteins analyzed. 18-20 
are suspected to be A-terminally blocked (6 proteins blo- 
cked according to Swiss-Prot. 12 proteins with M, S or A 
as A-terminal and assumingly blocked based on the cal- 
culated charge, and two proteins, involucrin and 
nucleolar protein B23. with M as A-terminal for which 
the data does not allow any conclusion). This is in rea- 
sonable agreement with the conclusions based on the 
A'-terminal sequencing data derived in connection with 
2-D gel electrophoresis. A-terminal blockage can be sus- 
pected for 17—19 of the 26 proteins with M, S or A as 
A'-terminal. while only 1 in 10 proteins with other 
A ; -terminal groups are blocked. The information that the 
frequency of A-terminal blockage is strongly related to 
the nature of the A-terminal group will be of some help 
in connection with pi predictions based on database 
information. However, without information from other 
sources, an uncertainty will always remain as to whether 
the A-terminal charge should be included in the pi calcu- 
lation. 



4 Concluding remarks 

The data presented here lays the foundation for com- 
paring 2-D gel protein maps of different cell types gener- 
ated with nonlinear, wide-range IPGs in the first dimen- 
sion. The focusing positions of 41 polypeptides common 
to most human cell types have been described in a pH 
scale that allows focusing positions to be predicted with 
a high degree of accuracy, provided that the composition 
of the polypeptides are known and that information on 
posttranslational modifications are available. For poly- 
peptides with a very high buffer capacity, the limiting 
factor is the precision with which experimental pH 
values can be determined rather than the precision of 
the calculations. Possible deficiencies in the pH scale 
description of the variation of the hydrogen ion activity 
has. at least at the present state, no consequences for its 
practical use. The major limitation in connection with 
predictions of focusing positions from polypeptide com- 
positions is the quality of existing data on protein com- 
positions, especially concerning posttranslational modifi- 
cations. Amino acid sequences have been reasonably 
easy to obtain, while posttranslational modifications 
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have been difficult and work-intensive to determine. 
Recent developments in the field of mass spectrometry 
are fast changing this situation and within the next years 
we can expect a surge in reliable data in this area. While 
awaiting this development, verification of correctness 
and completeness of available information on polypep- 
tide composition can be provided by experimental p/ 
values in a pH scale based on the pi values determined 
in this study. So far. our data cover the pH range below 
pH = 7.5. The basic pH range covered by NEPHGE as 
first dimension will be covered in forthcoming work. 

Received December 29. 1993 
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Large Scale Biology Corporation 

Large Scale Biology Corporation is the leader in the integrated discovery, production 
and application of proteins - the functional units of all biological processes. 

Large Scale Biology Corporation (LSB, Vacaville, CA) and its subsidiary Large Scale 
Proteomics Corp. (LSP, Germantown, MD) are a biotechnology enterprise with the mission of 
accelerating the speed and productivity of the life sciences industry product discovery and 
development programs. Unique among biotechnology companies is LSB's integration of 
technologies to discover, analyze, manufacture and find new applications for proteins - the 
functional units of all biological processes. 

Genomics companies have focused on deciphering genetic information, providing an initial but 
only partial understanding of biological processes. LSB's proprietary protein technologies can 
enable the transformation of genomic information into products such as drug targets 
therapeutics, diagnostics for drug efficacy and toxicity, and traits for agricultural crop's Large 
Scale Biology has gone beyond the "genomics" realm in its business model and developed 
ways to integrate the discovery of gene function with quantitative protein analysis and protein 
manufacturing. This integration of technology platforms favorably positions LSB as a leading 
provider of valuable content to industry leaders in the fields of diagnostics therapeutics 
vaccines and agribusiness. 

LSB was founded in 1987 with the goal of commercializing its proprietary GENEWARE viral 
vector system - a novel technology for gene expression. Using safe RNA viruses to transiently 
express genes in non-recombinant plants, LSB has positioned itself in the industry to provide 
cost-effective manufacturing and purification of diverse protein and peptide products The 
same technology can be applied to the expression of libraries of foreign genes in an 
automated, high-throughput format to discover the function of genes with unparalleled 
efficiency. The GENEWARE system and associated proprietary technologies form the basis 
tor LSB s functional genomics, biomanufacturing and a variety of proprietary products under 
development. 

From its foundation, LSB understood the need to integrate functional genomic and protein 
manufacturing expertise with quantitative protein analysis and informatics to become a 
world-leader in the protein field. |n 1999, LSB acquired a privately held pharmaceutical 
proteomics company originally founded in 1985. Large Scale Proteomics Corporation (a wholly 
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. r owned subsidiary of Large Scale Biology Corporation) is an industry leader in identifying and 
* characterizing proteins in all types of biological samples for the discovery and development of 
new and more effective therapies, diagnostics, and agricultural products. 

"Proteomics" is the study of the entire complement of proteins expressed in a cell, tissue, or 
organism. Proteomics can significantly improve drug discovery and development because 
most illness is associated with imbalances among, or malfunctions of, proteins. Only a small 
fraction of diseases can be attributed to the presence of a defective gene. Unlike classical 
genomics approaches that discover genes that may relate to a disease, LSP has developed a 
proprietary system called the ProGEx module for directly characterizing proteins associated 
with disease. Using this same technology, LSP can characterize the effects of candidate drugs 
intended to reverse a disease process, and to determine the degree to which this objective is 
achieved free of adverse side effects. 

LSB and LSP have protected their many discoveries though an extensive portfolio of domestic 
and foreign patents and have developed commercial alliances and partnerships to exploit the 
value of their technologies. LSB and LSP scientists and engineers focus on the development 
and application of resources to help clients meet their objectives as well as the development of 
our own proprietary products for subsequent partnering with industry leaders. 

A combined staff of 140 professionals operates from three locations in the United States with 
a network of collaborators and affiliates throughout the US and Europe. Company 
headquarters, R&D laboratories and its Genomics division are located in Vacaville, California 
about 60 miles northeast of San Francisco. Process development and biomanufacturing take 
place in Owensboro, Kentucky, and LSB's Large Scale Proteomics Corporation subsidiary is 
located in Germantown, Maryland. 

In August, 2000, LSB completed an initial public offering (IPO) of 5 million shares of common 
stock and now trades on the NASDAQ under the symbol LSBC. 

Leadership - Large Scale Biology Corporation 

Robert L. Erwin, Chairman of the Board and Chief Executive Officer, founded LSB™ and has 
served as a director and officer since 1987. Mr. Erwin is the former chairman of the State of 
California Breast Cancer Research Council and currently serves on the University of California 
President's Engineering Advisory Council. He is Chairman of the Supervisory Board of Icon 
Genetics AG. As a co-founder of Sungene Technologies Corp., Mr. Erwin served as Vice 
President of Research and Product Development from 1981 through 1986. He has served on 
the Biotechnology Industry Advisory Board for Iowa State University. Mr. Erwin received his 
M.S. degree in Genetics from Louisiana State University and is an inventor on several LSB 
patents. 

David R. McGee, Ph.D.,a co-founder of LSB and Senior Vice President and Chief Operating 
Officer, has been an officer since 1 987. Prior to joining LSB, Dr. McGee was Vice President of 
Operations at Sungene Technologies Corporation from 1983 to 1987. Dr. McGee received his 
Ph D. in Genetics from Louisiana State University and served as a faculty instructor of zooloqy 
and genetics at Louisiana State University. 

Laurence K. Grill, Ph.D.,a co-founder of LSB and Senior Vice President, Research and 
Development, has served as an officer since 1 987. Dr. Grill was the Manager of Plant 
Molecular Biology for Sandoz Crop Protection Corp. from 1984 to 1987 and Senior Research 
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„ Scientist in the Department of Molecular Biology at Zoecon Research Institute from 1 980 to 
- 1 984. He received his Ph.D. from the University of California at Riverside with an emphasis on 
„ * the molecular basis for viral gene expression in plants. 

R. Barry Holtz, Ph. D., Senior Vice President, Biopharmaceutical Manufacturing, has served 
the company as an officer since 1989 upon the acquisition of Holtz Bio-Engineering, which 
was founded in 1980. Dr. Holtz was a co-founder and Director of Research for MFI, Inc., the 
largest manufacturer of microencapsulated nutrients for agriculture and Director of 
Fundamental Research at Foremost-McKesson, Inc. Dr. Holtz received his Ph.D. in 
Biochemistry from Pennsylvania State University and served as Assistant Professor in the 
Department of Food Science and Nutrition at Ohio State University. 

Daniel Tuse, Ph.D., has been an officer of LSB since he joined the Company in 1995 as Vice 
President, Pharmaceutical Development. Dr. Tuse manages the company's pharmaceutical 
design and development programs, including LSB's novel vaccines and immunotherapeutics 
initiatives. Prior to joining LSB, Dr. Tuse was Assistant Director of SRI International's (Menlo 
Park, Calif.) Life Sciences Division. In his 17 years at SRI, Dr. Tuse developed extensive R&D 
experience in pharmaceuticals and specialty chemicals, serving an international list of clients. 
Dr. Tuse received his Ph.D. in Microbiology (1980, cum laude) with a minor in Toxicology from 
the University of California, Davis. 

John S. Rakitan, a co-founder of LSB, Senior Vice President & General Counsel and 
Secretary, has served as an officer since 1988. Prior to joining LSB, Mr. Rakitan was an 
attorney in private practice. Mr. Rakitan received his J.D. degree from the University of Notre 
Dame. 

Michael D. Centron, Treasurer, has served as Controller since 1 988 and was elected as 
Treasurer in 1991 . Mr. Centron was Audit Supervisor for Varian Associates from June 1985 
through July 1988, and he also worked for Arthur Young and Co. (currently Ernst & Young). 
Mr. Centron is a certified public accountant and received his M.B.A. degree from the University 
of California at Berkeley. 

Guy della-Cioppa, Ph.D., is an officer of the company and currently serves as Vice President, 
Genomics. Prior to joining the company in 1 989, Dr. della-Cioppa worked for Monsanto 
Company in St. Louis, MO from 1984-1989 and was an NIH Postdoctoral Fellow at the 
Worcester Foundation for Experimental Biology in Shrewsbury, MA from 1983-1984. He 
received his Ph.D. in Biology from the University of California, Los Angeles. 

William M. Pfann joined Large Scale Biology in August 2000 as Senior Vice President Finance 
and Chief Financial Officer. Mr. Pfann was formerly with PricewaterhouseCoopers LLP from 
1 969 to July 2000, most recently as the Risk Management Partner for the Western Region. He 
served in a number of management roles at PwC, including leader of the firm's Silicon Valley 
audit practice, National Director of the networking and communications sector and Managing 
Partner of the Northern California emerging business group, as well as Partner-in-Charge of 
the Oakland and Walnut Creek, California offices. Mr. Pfann received a B.S. degree from the 
University of California, Berkeley, in Business Administration and an MBA in Accounting from 
Golden Gate University. 

back to index 

© 2000 Large Scale Biology Corporation. All Rights Reserved Worldwide. 
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Large Scale Proteomics Corporation 
Leadership - Large Scale Proteomics Corporation 

A/. Leigh Anderson, Ph D., Chairman, President and CEO of Large Scale Proteomics 
Corporation (LSP™). Dr. Anderson obtained his B.A. in Physics with honors from Yale and a 
Ph.D. in Molecular Biology from Cambridge University (England) working with M. F. Perutz as 
a Churchill Fellow at the MRC Laboratory of Molecular Biology. Subsequently he co-founded 
the Molecular Anatomy Program at the Argonne National Laboratory (Chicago) where his 
work in the development of 2-dimensional electrophoresis (2-DE) and molecular database 
technology earned him, among other distinctions, the American Association for Clinical 
Chemistry's Young Investigator Award for 1982 and the 1983 Pittsburgh Analytical Chemistry 
Award. In 1985 Dr. Anderson co-founded LSP (originally Large Scale Biology Corp., 
Germantown, MD) in order to pursue commercial development and large-scale applications 
of 2-D electrophoretic protein mapping technology. 

Norman G. Anderson, Ph.D., Chief Scientist at LSP. Dr. Anderson has a distinguished record 
as an inventor. His career includes senior positions at Oak Ridge and Argonne National 
Laboratories (ORNL and ANL), more than 300 scientific publications, and the receipt of more 
than 20 prestigious awards in recognition of his work in science and technology. For his 
invention of the zonal ultracentrifuge, he received the John Scott Medal Award, and for the 
centrifugal fast analyzer, the Preis Biochemische Analytik fur Klinische Chemie from Die 
Deutsche Gesellschaft fur Klinische Chemie for the most outstanding analytical development 
in clinical chemistry worldwide during a 2-year period. In 1984 ANL awarded him its career 
patent leader award for the largest number of patents issued to an employee. At that time the 
commercial value of his inventions in terms of U.S. sales and royalties from foreign licensing 
were $250 million and $1 million, respectively. Dr. Anderson received his degrees at Duke 
University: a B.A. in Zoology, M.A. in Physiology, and Ph.D. in Cell Physiology. He holds 28 
patents. 

Constance Seniff,V\ce President, Operations. Ms. Seniff has managed LSP's operations 
since 1993. Her background includes thirteen years in international business prior to joining 
LSP, five abroad in the employ of foreign firms. Ms. Seniff is responsible for helping 
formulate and implement business development and database commercialization strategies 
for LSP in coordination with the management of LSP's parent company, Large Scale Biology 
Corporation. Ms. Seniff has a B.Sc. degree in Business (with honors) from Florida State 
University. 

Robert J. Walden, Vice President, Finance at LSP. Mr. Walden joined LSP in 1997 and has 
served as a director since 1999. He previously served as Vice President of Finance and 
Administration at Osiris Therapeutics, Inc., and as Chief Financial Officer at the American 
Type Culture Collection (ATCC). Mr. Walden received his degree in Finance from the 
University of Maryland. 

Jean-Paul Hofmann, Ph.D.,V\ce President, Software Development at LSP. Dr. Hofmann is a 
plant geneticist by training, having earned a B.S. in Biology, M.S. in Biochemistry and 
Genetics, and Ph.D. in Plant Genetics from the University of Orsay, Paris. He has extensive 
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. r experience in using 2-DE in agronomic research and in designing analytical software for 1 - 
and 2-D applications. He has held senior scientific positions in industry and research 
institutes, in the U.S., France and the Ivory Coast. 

Jorjn Taylor, Ph.D.,V\ce President, Software Development and Bioinformatics. Dr Taylor is 
the principal developer of Kepler™, LSP's analytical software for automated 2-DE pattern 
analysis. Prior to joining LSB, Dr. Taylor served as computer scientist in the Molecular 
Anatomy Program at Argonne, and on the research staffs of the University of Chicago and 
the Armed Forces Institute of Pathology in Washington, D.C. Dr. Taylor received a B S in 
Physics from the University of South Carolina, and a Ph.D. in Nuclear Physics from Duke 
University. 

Sandra Steiner, Ph.D., currently serves as Vice President Proteomics Applications Prior to 
joining the Company, Dr. Steiner founded and directed the Molecular Toxicology Group at 
Novartis in Basel, Switzerland and was a member in several multi-disciplinary drug 
development project teams. Dr. Steiner received her Ph.D. in Toxicology/Pharmacology from 
the University of Basel, Switzerland. yy 

back to index 
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ConM«n3aJ - Property at tncyta Qenomct, *ic SeqS^vw 



Program: blastp 
Sequence ID(») : 

Q PP-0232-lDIV_HNT2NGT01 va. qenceptl32 



NCBI-BLASTP 2.0.10 [Aug-26-1999] 

Reference: Altschul, Stephen F-, Thomas L. Madden, Alejandro A SchafEer, 
Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997), 
•Gapped BLAST and PSI-BLAST: a new generation of protein database search 
programs", Nucleic Acids Res. 25 -.3389-3402. 

Query = PF-0232-1DIV_HNT2NOT01 
(491 letters) 

Database: AJ-PgtlM 372 , se3 , 10fi totol lettera 



Value 

0.0 
0.0 
»-179 
(-179 
454 e-126 
453 e-126 
451 e-125 
450 e-125 



825 
628 
628 



Sequences producing significant alignments: 

E3 n21618457 tubby like protein 3 (Homo sapiens) 

3 al372493 tubby like protein 3 [Homo sapiens] 

3 a4239948 tubby [Mus nwsculus] 

3 g3372491 tubby like protein 3 [Mus musculus] 

E3 7^551050 TUBBY protein [Rattus norvegicus] 

3 ?1305497 tub homo log [Homo sapiens) 

3 9 2Q72160 tub homolog [Homo sapiens] 

H QH253111 tubby protein [Mus kuscuIub] 

3 a!305499 tubby [Mus musculus] 1¥% m 

3 011071535 tubby (mouse) homo log [Home sapiens] ^i w- 

> c216ia45? tubby like protein 3 [Homo sapiens] 
Length » 442 

Score - 986 bits (2264), Expect =0.0 

Identities =■ 435/439 (99%) , Positives = 437/439 (99%) 

Query- 1 MEASRCRLSPSODSVFHEEKKraWQAKl^ 60 

MEASRCRLS PSGDSVFHESMMKMRQAKLDYQFXLLE^ 
Sbjct: 1 MEASRCRI^PSGDSVraEEMMKMRQAKlJJYQRIXlJ^ 6° 

Query: 61 KPRASDEQTPLVNCHTPHSNVI LHGI DQ PAAVI^PDEVHAPSVSSSWEEDAEOTVETAS 120 

KPRASDEQTPLVNCTTPHSNVILHGIDGPAAVLKFDEVHAPSVSSSVVE^ 
Sbjct: 61 KPRASDEQTPLVNCHTPHSNVILHGIDGPAAVLKFDEW^ 120 

Ouerv- 121 KPGLQERLQKHDI SESVNFDEETDG I SQS ACXiERPNS ASSQNSTUITJTSCS AT AAQPADN 180 

" KPGLQERLQKHDI SESVNFDEETDG I SQSACLERPNSAS SQNSTDTQTSGS ATAAQPADN 

Sbjct: 121 KPGLQERLQKHDI SESVNFDEETDGI SQSACLERPNSAS SQNSTDTGTSGSATAAQPADN 180 

Query 181 LLGD I DDLEDFVYS PAPQGVTVRCR I IRDKRGMDRGIJPTYYMYLEKEENQKI FLLAARK 240 

" LLGDID LEDFVYS PAPQGVTVRCRI IPJJKRGMDFGlJPTyYMYLEKEENQK I FLLAARK 

Sbjct : 181 LLGDI DYLEDFVYS PAPQGVTVRCR I I RDKRGMDRGLFPTYYMYLEKEENQKI FLLAARK 240 

Query: 241 RKKSKTANYLISIDFVDLSREGESYVGKIJISNLMQTKF^ 300 



SI 



RKKSKTANYLI SIDPVD] 
Sbjct: 241 RKKSKTANYLI SID1~ 



3PVD JfcpVOKLRSNLMGTKFTVYDRG: 
jIVD[^^^Bn«SJIJl£NIi«7rKFIVYDRQ: 



i I C PKKOROLVGAlAHTR 
3ICPMKGRGLVOAAHTR 300 



Query- 301 QELAAISYErNVLOFKQF^^^Vl I PGMTLNHKQI PYQPQNNHDSLLSRWQNBTMEKLVE 360 
Query. g^'l^^^ppj^^ 

Sbjct: 301 QELAAI 5YETNVLOFKOPRKMSVI I PGMTLNHKQI FY Q PQNNHDS LLSRWQNRTMEIJLVE 360 

Query- 361 LHNKAPVWNSDriQSYVLNFRaRVT^^ 420 

LHNXAFVWNSOTttSWUlFl^m^ 
Sbjct: 361 IJINKAPVWNSOTQSYVUJFT^VTQA 420 

Query: 421 PLCAVQAFO IGLSSFDKRI 439 

PLCAVQAFGIGLSSFD ++ 
Sbjct: 421 PLCAVQAFO IGLSSFDSKL 439 

> ?3372493 tubby like protein 3 {Homo sapiens] 
Length =* 442 

Score = 825 bits (2109), Expect - 0.0 

Identities « 410/439 (93%), Positives - 413/439 (93%) 

Query* 1 MEASRCRLS pSGDSVFHEEhDWHRQAKIiDYQPJXl^KRQRKKR 60 

' MEASRCM^PSQDSVFHEEKKKMRQAKJLJ^QRLLl^ 
Sbjct: 1 MEASRCRLSPSGTOSVFKE211MKMRQAKLDYQRU.LEKRQR 60 

Query- 61 KPRASDEQTPLVNCHTPHSb^uraiDGPAAVTjCFDF^ 120 

KPFJVSDECTPLVNCHTPHSNVIUK5IIX3PAAVLKH5EVHAPSVSSSWEm 
Sbjct: 61 KPRASDBQTFLVNCHTPHSNVI LHO I DOPAAVLKPDEVHAPSVS SSWEEDAENTVDTAS 120 

Query- 121 KPGLQERLQKHDI SESVNFDEETDG I SQSACLERPNSAS SQNSTDTGTSGSATAAQPADN 180 

KPGLQERLQKHDI SESVNFDEETDG 1 SQSACLER PNSAS SQNSTDTG 
Sbjct: 121 KPGLQERWJKHDISESVOTDEETDGISQSACLERPNSASSQ^^^ PVLLLPPNQLIT 180 

Ouerv- 181 LLGDIDDI^FVYSPAPQGVTTOCRIIKDIOtGMDML^ 240 

LGDI DDLEDFV PAPQGVTVRCRI IRDKRGMDRGLF L KEENQKI FLLAARK 

Sbjct: 181 FLGD IDDLEDFVLWAPQGVTVRCR I IRDKRGMDRGLFSHU^YVLGKEENQKI FLLAARK 240 

Ouerv- 241 RKKSKTANYL I S IDPVDIJJREGESYVGKLRSNLMGTKFTVYDRG ICPMKGRGLVGAAHTR 300 

RKKSKTANYL IS I DFVDLSREGESYVGKIJlSffl J *K3TlCFTVYDRG I CPMKGRGLVOAAHTR 
Sbjct: 241 RKKSKTANYL1S I DPVDLSREGESYVGKIJISNIJ4GTKFTVYDRG ICPMKGRGLVGAAHTR 300 

Query: 301 QELAA I SYETNVLGFKG PRKMSVI I PGOTLNHKQI PYQPQN^^HDSU^RWQ^IRTME^^LVE 360 

QELAAI SYETNVLGFKGPRKMSVI I PGMTUJHKQI PYQPC^JNHDSLLSRWQNRTMENLVE 
Sbjct: 301 QELAA I SYETNVLGFKG PRKKSVT I PGMTLNHKQ I PYQPQNNHDSLI>SRWQNRTMENLVE 360 

Ouerv- 361 LHNKAPVWSDTQ^YVLNFT^VTQASVKNFQIVHKN^ 420 
UB^FVWNSirrQSYVIJIFT^VTQAS^ ^ n 
Sbjct: 361 umKAFVWNSrrrQSYVLNFRGRVTQASVKN^ 420 

Query: 421 PLCAVQAFGIGLSSFDKRI 439 

PLCAVQAF I LSSFD + + 
Sbjct: 421 PLCAVQAFAISLSSFDSKL 439 

>q 4 , 339948 tubby [Mus musculus] 
Length = 460 

SKui^': k;t«s™u>?&£™»"«h <7n>, <«. ■ */«#. <«> 

Ouerv- 1 KEASRCRLSPSGDSVFHEIMMKMRQAKlJTi'QRLL 60 

MEA+RC P GDS F +E +++RQ KLD QR LLEK+QRKKKLEP MVQPtlPEARLRR 
Sbjct: 1 MEAARCLAPGPRGT>SAFDDET1JUJ^QLKLDNQRALLEKX 60 

Query: 61 KPRASDEQTPLVNCHTPHSNVI LHQ I DGPAAVLKPD- EVHAPSVSSSWEEDAEN 114 

KPR S+E TPLV+ P S+VXLHGIDGPAA LKP+ + SV S EE E 

Sbjct: 61 KPRGSEEHTFLVDPQMPRSDVI LHGIDGF AAFLKPEAQDLESKPQVLSVGS PAPEEGTEG 120 

Query- 115 TVD TASKPGLQERLQKHDI SESVNFDEETDG ISQSACLERPNSASS 160 

+ D TA KP LQE LQKH I ^^^^ t ^^ EQQWLS g p |^ RS | T r < ; 180 

Query: 161 QNSTDTTrTSGSATAAQPAIJMjLG^ 220 



+ +++TG SG AQ* D LG++++LEDF YSPAP+GVTV+C++ RDK+GMDRGLFPT 
Sbjct: 181 KAASETGASG - - VTAQQGDAQLGEVENLEDFAYS P APRGVTVKCKVTRDKKGMDRGLFPT 238 

Query- 221 YYMYIjEKEETOKIFLLAARKRKKSKTANYLISIDFVDLSREG 280 

' YYM+ LE+ EEN+KI FLLA RKRKKSKT+NYL+S DP DIJaREGESY+GKLRSNLMGTKFTV 

Sbjct: 239 YYMHLEREENRKI FLIJ^GRKRKKSKTSNYLVSTDFTDLSREGESY I GKLRSNLMGTKFTV 298 

Query- 281 YDRG 1C PMKGRGLVGAAHTRQELAAI SYETNVLGFKGPRKMSVI 1 PGMTLNHKQI PYQPQ 340 

YD G+ P+K +GLV AHTRQELAAI YETNVLGFKG PRKMSVI I PGM +NH++IP++P+ 
Sbjct: 299 YDHGVNFVKAQGLVEKAHTRQELAA 1CYETNVK3FKGPRKMSVI I PGMNMNHER I PFRPR 358 

Query- 341 NNHTJSl^RWQNRTMEKLVElJlNKAFVVWSDrQSYV^^ 400 

N H+SLLS+WQN++MENL+ELHNKAPVWN DTQSYVLNF GRVTQASVKNFQIVH HDPD 
Sbjct: 359 NEHESLLSKWQNKSMENL I EIJiNKAPVWNDDTQSYvLKFHGRvTQASVKNFQ IVHGNDPD 418 

Query: 401 YIVMQFGRVADEVFTLDYNYPLCAVQAFGIGLSSFDKRI 439 

YI VMQPGRVADDVFTLDYNY PLCA+ QAF IGLSSFD ++ 
Sbjct: 419 YrVMQFGRVADmTTLDYNYPLCALQAFAIGLSSFDSKL 457 

>a2222£21 tubby like protein 3 [Mus wusculua] 
Length = 460 

Score = 628 bite (1602), Expect - e-179 „, JCO ,, 41 

Identities - 314/459 (68%), Positives = 364/459 (76%), Gaps - 22/459 (4%) 

Ouerv- 1 MEASRCRLS PSGTOSVT'HE£MMKMRQAKLnYQRLIXEKRQR03U^ 60 

MEA+RC P GDS F +E +++RQ KLD QR LLEK+QRKKRLEP MVQPNPEARLRR 
Sbjct: 1 MEAARCAPGPRGDSAFDDETLRLRQIjaJJITORMJ^^QR 60 

Query: 61 KPRASDEQTPLVNCHTPHSNVI LHGI DGP AAVLKPD EVHAPSVSSSVVEEDAEN 114 

KPR S+E TPLV+ P S + VI LHGIDGPAA LKP+ + SV S EE E 

Sbjct: 61 KPRGSEEHTPLVDPQMPRSDVILHGIDGPAAFl^PEAQDLESKPQVX^VGSPAPEEGTOT 120 

Query- 115 TVD TASKPGLQERLQKHDI SESVNFDEETDG ISQSACLERPNSASS 160 

+ D TA KP LQE LQKH I SVN+DEE D S SA E +AS 

Sbjct: 121 S ADGE5PEET APKPDLQE I LQKHG I LS SVNYDEEPDKEEDEQGNLSS PSARSEES AAASQ 180 

Query: 161 QNSTDTGTSGS ATAAQPADNL1^3DIDDLEDFVYS PAPQGVTVRCRI IRDKRGKDRGLFPT 220 

+ +++TG SG AQ D LG++++LEDF YSPAP+GVTV+C++ RDK+GMDRGLFPT 
Sbjct: 181 KAASETGASG — VTAQQGDAQLGEVENLEDFAYS PAPRGVTVKCKvTRDKKGMERGLFPT 238 

Query* 221 YYMYLEKEENQK I FXIAARKRKKSKT ANYLI S I DPVDLSRJCESYVGKLRSNLMGTKFTV 280 

YYM+ LE+ EEN+KI FLLA RKRKKSKT+NYL+S DP DLSREGESY+GKLRSNLMGTKFTV 
Sbjct: 239 YYMHLEREENRKI FLLAGRKRKKSKTSNYLVSTDPTDLSREGES Y IGKLRSNLMGTKFTV 298 

Query: 281 YDRGI CPMKGRGLVGAAHTR QELAA I SYETNVLGFKGPRKMSVI 1 PGtfTLNHKQI PYQPQ 340 

YD G+ P+K +GLV AHTRQELAAI YETNVLGFKG PRKMSVI I PGM +NH++IP++P+ 
Sbjct: 299 YDHGVNPVKAQGLVEKAHTRQELAAICYETNVLX3FKGPRKMSVI I POMNMNHERI PFRPR 358 

Query- 341 NNHD5LLSRWQNRTMENLVELHNKAPVWNSDTO 400 

N H + SLLS +WQN+ +MENL+ ELHNKAPVWN DTQSYVLNF GRVTQASVKNFQIVH NDFD 
Sbjct: 359 NEHESIJ^SKMQNKSMENLIELHNKAPVWNDDTQSYVLM^GE 418 

Query: 401 YIVMQFt3RVArDVFTLDYNYPLCAVQAFG IGLSSFDKRI 439 

Y I VMQFGRVADDVFTLDYNYPLCA +QAF IGLSSFD ++ 
Sbjct: 419 Y I VMQFGRVADDVFTLDYNYPLC ALQAFA IGLSSFDSKL 457 

> g3551050 TUBBY protein [Rattus norvegicus] 
Length » 505 

Score = 454 bits (1156), Expect - e-126 ,,,.„, 
Identities = 244/491 (49%), Positives = 316/491 (63%), Gaps - 66/491 (13%) 



14 



SVFliEEMMKMRQAKIOTQRlXLl^QRKKRLEPFMVQPNPEARLRRAK 73 
SV +E +RQ KLD QR LLE++Q+KKR EPMVQN+RR +R S+EQ PLV 
SVUDDEGSNLJIQQKLDRQRAIXEQ^QKKKRQEPLMVQANADGR EEQAPLVE 72 



Query: 
Sbjct: 

Query: 74 CHTPHSNVILH— - " " 64 

+ S + 

Sbjct: 73 S YLS S S G STS YQVQEADS LASVQ PG ATR P P AP AS AKKTKGAAASGQQOGA PRKEKKGKHK 132 
Query: 85 Q IDOPAAVLKP- DEVKAPSVSSSVJEED-AENTVDTASKPG LQERLQKHDI SE 135 



Sbjct: 
Query: 
Sbjct: 
Query: 
Sbjct: 
Query: 
Sbjct: 
Query: 
Sbjct: 
Query: 
Sbjct: 
Query: 

SbjCt: 



G GPA + + E P +V + D A++ +TA+ G L+ +0+ IS 

133 GTSG P ATLAEDKS EAQGPVQI LTVGQSDHAKDAGETAAGGGAQPSGQDLRATMQRKG I SS 192 

136 SVNFDEETD GISQSAa^PNSASSQNSTEfTGTSGSATAAQPADNLLGDIDDLE 189 

S++FDEE D SQ RP+SA+S+ ST S + AA P + ++ DLE 

193 SMS FDEEEDEDENSS S S SQLNSNTRPSS ATSRKSTREAAS APSPAA- PEPPVDI EVQDLE 251 

190 DFVYSPAPQGVTVRCRI I RDKRC3MDRGLFPTYYMYXJ3CEENQK IFIXAARKRKKSKTANY 249 

+F PAPQG+T+ + CRI RDK+GMDRG+ + PTY+ + +L+ + E + +K+FLLA RKRKKSKT+NY 
252 EFAIAPAPQGITIKCRITRDKKGMDRGMYPTYFIin^^ 311 

308 

371 

368 

431 

428 

491 



250 LISIDPVDLSRBGESYVGKIJISNIMmtFTVYDEGIC 

LIS+DP DLSR G+SY+GKLRSNLMGTKFTVYD G+ P K + T RQELAA+ Y 

312 LI SVDPTDLSRGGDSY IGKlJlSNLMaTKPT\nfDNGVNPQKAS SSTLESGTLRQELAAVCY 

309 ETNVLGFKGPRKMSVI I PGMTLNHKQI FTQPQ^MHDSLI-SRVJQNRTMENLVE1J{NKAPVW 
ETNVLGFKGPRKMSVI + PGM + H+++ +P+N H++LL+KWQN+ E+++EL NKPVW 
372 ETNVUJFKGPRKMSVIVPGMNMVHIRVCIRPRNEHE^^^ 

369 NSDTQSYVLliyRGRVTQASVKrn^IVHKNDPDY I VMQFGRVADDVFTLDYNYPLCAVQAF 
N DTQSYVLNF GRVTQ ASVKNFQI +H NDPDYIVMQPGRVA+DVFT+DYNYPLCA+QAF 
432 NDEfTQSYVUIFHGRVTQASVKNFQI I HGNDPDYI VMQFORVAEDVFTMDYNYPLCALQAF 

429 G IGLSSFDKRI 439 

I LSSFD ++ 
492 AIALSSFDSKL 502 



> ? 1 305497 tub homo log [Hotro sapiens] 
Length = 506 

KJi 5J",IJ 1 "U,! X ?r.! t Iv:» 1 "31 8 /4 9 4 ,63.,. M. - 71/.M ,14%, 

Sbjct: 13 SVIXCEGFJJLRCjQKIXlRQRALLEQKQKK^ 72 

Query: 74 CHTPHSNVILH 84 

Sbjct: 73 SYI^SSGSTSYQyQEADSLASVQLGATRPTAPASAKRTKAAATAGG^ 132 

Query: B5 GIK3PAAVLKP-DEVHAPSVSSSVVEED-AENTVDTASKPG - LQERLQKHDI SE 135 

G GPAA+ + E P +V + D A++ +TA+ G L+ +0+ IS 

Sbjct: 133 GTSGPAALAEKKSEACGWQILTVGO^DHAQDAGETAAGQGER 192 

Query: 136 SVNFDE ETDGISQSACLE RPNSASSQNSTDTGTSGSATAA QPADNLLGD1D 186 

S++FDE E + S S+ L RP+SA+S+ S S + A QP D ++ 

Sbjct: 193 SMSFDEDEEDEEENS SSS SQLNSNTRPS S ATSRKSVREAASAPS PTAPEQPVDV EVQ 249 

Query: 187 DLEDFVYS PAPQGVTVRCRI IRDKRGHDRGLFTTYYMYI^EKEENQKI FLLAARKRKKSKT 246 

DLE+F PAPQG+T+ +CRI RDK+GMDRG++PTY+++L++E+ +K+FLLA RKRKKSKT 
Sbjct: 250 DLEEFA1JIPAPQ3ITIKCRITRDKKGMDRGMYPTYFIJ1LDREDQ5^^ 309 

Query: 247 ANYLI S IDPVDL5PJX3ESYVGK1JSNLMGTKFTVYDRG I CPMKGTIGLVGAAHT-RQELAA 305 

+NYLIS+DP DLSR G+SY+GKLRSNLMGTKFTVYD G+ P K + T RQELAA 

Sbjct: 310 SNYL1 SVDPTDLSRGGDSYI GKLRSNU1QTKFT\/YDNGVNPQKAS SSTLESGTLRQELAA 369 

Query: 306 I SYETNVLGFKGPRKMSVI I PGMTLNHKQI J^^'^^^VYS^^* 365 

+ YETNVLGFKGPRKMSVT+PGH + H+++ +P+N H++LL+RWQN+ E+++EL NK 
Sbjct: 370 VCYETNVLGFKGFRKMSVrVPGMNMVHERVS IRPRNEHETLLARWQNKNTES 1 1 ELQNKT 429 

Query: 366 PVVmSDTQSYVI.NFRGRVTQASVKNFQrV^ 425 

PVWN DTQSYVLNF GRVTQASVKNFQI+H NDPDYIVMQPGRVA+DVFT+DYNYPLCA+ 
Sbjct: 430 P^NDD^S^^ 489 

Query: 426 QAFG IGLSSFDKRI 439 

QAF I LSSFD ++ 
Sbjct: 490 QAFAIALSSFDSKL 503 

> fi?072160 tub homo log [Homo sapiens] 
Length = 561 



Score - 451 bits (1148>, Expect * e-125 

Identities = 244/493 (49%), Positives => 317/493 {63%), capo 



Query: 15 VFHEEMMKMRQAKLDYQRIJXEKRQRKKJUJSPFMVQPNPEAR 

„ V +E +R Q KLD QR LLE++Q+KKR EPMVQN+RR + r S+EO PLV " 

Sbjct: 69 VLDDEGRNLRCQKUJRQWuUJMKQKK^^ 128 

Query: 75 HTPHSNVILH _ _, 



Query: 



' Y ^SSGSTSYQVQEAI^LASVQLGATRPTAPA 



137 VNFDE 
++FDE 

Sbjct: 249 



IIWPAAV^P-DEVHAPSVSSSVVEED-AEWrVErrASKro LQERLQKHDISES 136 

GPAA * + E P +V + D A+ + +TA+ O l+ +0+ TS <: 

Sbjct: 189 TSG PAALAEDKSEAQGPVQI LTVGQSDHAQIW3ETAAGGGERPSGQDLMTMQRKG I SSS 248 

ETDG ISQSACLE RPNSASSQUSTDTGTSQS ATAA - -QPADNLLGDI DD 187 
E + S S+ L RP+SA+S+ S S + A QP D ++ D 

EEEO^SSSSQLNSNTRPSSATSIOSTOEAASAPSPTAPEQPVDV- --EVQD 305 
Query: 198 LEDFVYS PAPQGVTVRCR 1 1 RDKRCMDRGLFPTYYHYLEKETO 247 

LE+F PAPQ3+T+ +CRI RDK+GMDRG++PTY+++L++E+ +K+FLLA RKR(fTr<ntT* 
Sbjct: 306 ™AL*PA W ITIK CT ITRDKKGHDRG^ 365 

Query: 248 NYL I SI DPVDLSRlKJESYVGXLRSNLMGTra'TVYDRG I C PMKGRGLVGAAHT-RQELAAI 306 

- . « c ™ J S+DP DLSR G^+GKLRSNLMCTCTFTVYD 0+ P K + T RQELAA+ 

Sb 3 ct: 366 NYLISVDPTDI^OC^IGKIJlSNUtGrreFTVY^^ 425 

Query: 307 SYETWLGFKGPRKMSVI I PGMTLKHKQI PYQPQNt^liSRWQKirmENL'/ELHNKAP 366 

. *->c J^WU3raQPRKMSVI + PGM + H+++ +P+N H++LL+RWQN+ E+++EL KK P 
Sb 3 et: 426 CYETNVI^FKGPFJCMSvTVPGMNKVHE3?VS IRPFJiEHETIiARWQNKNTES 1 1 ELQNKTP 485 

Query: 367 WNSDTQSYVLNFRGRVTQASVKN^ 42 6 

^SSSO^,^^^^^ 1 +H NDFDYIVMQFGRVA+DVFT+DYNYPLCA+Q 
Sbjct: 486 VWNDDTQSYVLNFHGRVTQASVKNFQI IHGNDPDY I VMQFT3RVAEET/FTHDYWYPLCALQ 545 

Query: 427 AFGIGLSSFDKR I 439 

AF I LSSFD ++ 
Sbjct: 546 AFAIALSSFDSKL 558 

>flH25JLHl tubby protein [Mus miaculual 
Length = 505 

Score = 450 bits (1144), Expect = e-125 

Identities » 242/491 (49%), Positives = 314/491 (63%), G«p 9 = 66/491 (13%) 

° uer y : 14 ^VFHEIJlMKMRQAKI^YQRIJXEKRQRHau^ 73 
_.. fX, +RQ KLD QR LLE++Q+KKR EP MVQ N + R R + R S+EQ PLV 

Sb]ct: 13 SVLDDEGSNIJIQQKIJ3RQRAIJLEQKQKKKR 72 

Query: 74 CHTPHSNVILH a . 

+ S ♦ 84 

Sbjct: 73 SYLSS SGSTSYQVQEADS I ASVQLGATRPPAPAS AKKSKGAAASGGQGGAPRKEKKGKHK 132 

Query: 85 GIDGPAAVLKP-DEVHAP SVSSSWEEDA- EJnVDTASKPG LQERLQKHDI SE 135 

G GPA + + E P +V S ++DA E ++P L + +Q+ IS 

Sbget: 133 GTSGPATLAEDKSEAQGPVQILTVG<}SDHDKI^ETA 192 

Query: 136 SVNFDEETD GISQSACI^PNSASSCWSTDTOTSGSATAAQPAEHLLODIDDLE 189 

S++FDE+ D SQ RP+SA+S+ S S + AA P + ++ DLE 

Sbjct: 193 SMSFDEDEDEDENSSS 5SQLNSWTRPSSATSRKS IREAASAPS PAA - PEPPVDI EVQDLE 251 

Query: 190 DFVYS PA PQGVTVRCR I IRDKRGMDRGLFPTYYMYLEKEENQK I FLLAARKRKKSKTANY 249 

+F PAPQG+T+ +CRI RDK+GMDRG++ PTY+ + + L+ +E+ +K+FLLA RKRKKSKT+NY 
Sb;jct: 252 EFAIJlPAPQGITIKCRriTOKKGMDRGOT^ 311 

Query: 250 L I S I DFVDLSREGESYVGKIJISNUfGTKFTVYDRG I CPMKGRGLVGAAHT-RQELAA I SY 308 

LIS+DP DLSR G+SY+GKLRSNLMGTKFTVYD G+ P K + T RQELAA+ Y 

Sb 3 ct: 312 LISVDFTDLSRGGDSYIGKIJlSNIJtGT^^ 371 

Query: 309 ETOVUSFKOPRKMSVIIFGOTli^QI^ 36B 

ETNVLGFKGPRKMSVI+PGM + H+++ +P+N H++LL+RWQN+ E+++EL NK PVW 
Sbjct: 372 ETNVLGFKGPRKMSVIVPCMMVHERVCI^ 431 



Query: 307 SYETNVLGFKGPRKKSVI I PGMTLMHKQI PYQP^NHDSU-SRWQNRTMEHLVELHTIKAP 366 
e . . ^ ,„ YETOVLGPXGPRKMSVI+PGM + H+++ +P+N H++LL+RWQN+ E+++EL MK P 
Sbjct: 285 CYETKVLGFKGPRKMSVI VPGMNMVHERVS I RPRNEHETLLARWQNKNTES 1 1 ELQNKTP 344 

Query: 367 VWNSDTQSYVLNFRGRVTQASVKNFQIVH^ 426 
n . . V™ DTQSYVLNF GRVTQASVKNFQI+H NDPDYI VMQFGRVA +DVFT+ DYNYPLCA+Q 

Sb 3 ct: 345 VVMDDTQSYVIJ^GRVTQASVKNFQI IHGNDPDYIVMQPORVAEDVFTMDYKYPLCALQ 404 

Query: 427 AFG I GLS SFDKRI 439 

AF I LSSFD ++ 
Sbjct: 405 AFAIALSSFDSKL 417 



»PDYIVMQFGRVADDVFTLDYNYPLCAVQAF 428 
*H NDFDY I VHQPGRVA +DVFT+ DYNYPLCA+QAF 
I HGNDPDY I VKQFaRVAEDVFTODYNYPLCALQAF 491 



Query: 369 NSDTQSYVLNFW 
N DTQSYVLNF 

Sbjct: 432 

Query: 429 GIQLSSFDKRI 439 

I LSSFD ++ 
Sbjct: 492 AIALSSFDSKL 502 

> g 130 549? tubby (Mus irusculus] 
Length = 505 

Score * 450 bitB (1144), Expect = e-125 

Identities = 242/491 (49%), Positives » 314/491 (63%), Gaps = 66/491 (13%) 

Query: 14 SVFHEEhWXMRQAlOJWQRIJiEKRQR^ 73 

SV +E +RQ KLD QR LLE++Q+KKR EPMVQN+RR +R S+EQ PLV 
Sbjct: 13 SVU3DEOSNLRQQKI^RQRAU^KQKKra^ 72 

Query: 74 CHTPHSNVILH Q , 

Sbjct: 73 SYLSSSQSTSYQVQEADS I ASVQI^ATRPPAPASAKKSKGAAASGGQGGAPRKEKKGIQiK 132 

Query: 85 GIDGPAAVLKP-DEVHAP SVSS SWEEDA- ENTVDTASKPG LQERLQKHDI SE 135 

O GPA + + E P +V S ++DA E ++P L+ +Q+ IS 

Sbjct: 133 GTSGPATLAEDI^EAQGFVQILTVGQSDHD^ 192 

Query: 136 SVNFDEETD G I SQSAC1JSRPNS ASSQNSTUTGTSGS ATAAQPADNUjGEIDDLE 189 

S++FDE+ D SQ RP+SA+S+ S S + AAP +++DLE 

Sb 3 ct: 193 SMSFDEDEDEDENSSSS SQLNSNTRPSS ATSRKS IREAASAPS PAA- PEPPVDI EVQDLE 251 

Query: 190 DFVYSPAPQGVTVRCTI IRDKKGMDRGIJPTTYMYLEK^ 249 

+F PAPQG+T+ +CR I RDK+OIDRG++PTY+++L++E+ +K+FXXA RKRKKSKT+NY 
Sbjct: 252 EFAUIPAPQGITIKOUTRDKKGMDRGMYPT^^ 311 

Query: 250 LIS IDPVDI^REGESYVGKLRSNLMQTKFIVYDRG I CPMKGRGLVGAAHT-RQELAAI SY 308 

LIS+DP DLSR G+SY+GKLRSNLMGTKFTVYD G+ P K + T RQELAA+ Y 

Sb 3 ct: 312 LISVDPTDLSRGCSDSYIOKIJISNLMGTKFTVYDNGV^ 371 

Query: 309 ETNVLGFKGPRKMSVI I PGMTLKHKQ I PYQPQNNHDSIiSRWQNRTMET^VEIJOIKAPVW 368 

ETNVLGFKGPRKMSVI+PGM + H+++ +P+N H++LL+RWQN+ E+++EL NK PVW 
Sbjct: 372 ETNVl/SFKGPRKMSVrvPGMNMVHERVC I RPRNEHETLLARWQNKHrES 1 1 ELQNKTPVW 431 

Query: 369 NSDTQSYVIJJFRQRVI^3ASVKNFQIVHKKDPDYIW 42 8 

N DTQSYVLNF GRVTQASVKNFQI+H NDPDY IVMQFGRVA + DVFT+ DYNYPLCA+QAF 
Sbuct: 432 NDDTQSYV1JJFHGRVTQASVKNFQI IHGNDPDY I WQFGRVAEDVFTMDYNYPLCALQAF 491 

Query: 429 GIQLSSFDKRI 439 

I LSSFD ++ 
Sbjct: 492 AIALSSFDSKL 502 

>gJL102153Ji tubby (mouse) homo log (Homo sapiens] 
Length = 420 

Score = 421 bits (1071), Expect » e-116 

Identities = 213/373 (57%), Positives - 278/373 (74%), Gaps = 21/373 (5%) 

Query: 85 GID3PAAVUCP-DEVh^PSVSSSWEED-AENTVDTASKPG LQERLQKHDI SE 135 

° GPAA+ + E P +V + D A++ +TA+ G L+ +Q+ IS 

Sb]Ct: 48 GTSG PAALAEDKSEAQGPVQI LTVGQSDHAQDAGETAAGGGERPSGQDLRATMQRKG I SS 107 

Query: 136 SVNFDE ETDQISQSACLE- - - RPNS ASSQNSTDTGTSGSATAA-QPADNUjGDI DD 187 

S++FDE E + S S+ L RP+SA+S+ S + ++ S TA QP D ++ D 
Sbjct: 108 SMSFDEDEEDEEEN5 S SSSQLNSNIUPSS ATSRKSVEAASAPSPTAPEQPVDV EVQD 164 

Query: 188 LEDFVYS PAPQ3VTVRCR I IRDKRGMDRGLFPTYYMYLEKEENQK IFL.LAARKRKKSKTA 247 

LE+F PAPQG+T+ +CR I RDK+GMDRG++PTY+++L++E+ +K+FLLA RKRKKSKT+ 
Sb 3 ct: 165 LEEFALRP APQG ITI KCRITRDKKGMDRGMYPTYF1,HIJ)REDGKKVF1IJ^^ 224 

Query: 248 NYLISIDPVDI^SREGF^YVGKLP^NLMGTmVYDRQICPMKGR^ 3 Q6 
^ ^ NYLIS+DP DLSR G+SY+GKIJISNLMGTKFTVYD G+ P K + T RQELAA+ 

Sbjct: 225 NYLlSVDPTDLSRGGDSYIGKIJlSNIJKmtFTVYDNGVN^ 284 



Database; genpeptl32 

Posted date: Nov 14, 2002 2:57 pm 
Number of letters in database: 372,583,108 
Number of sequences in database: 1,206,111 



Lambda 
0.317 



0.133 0.390 



Gapped 

Lambda K I 
0.270 0.0470 



Matrix: BLOSUM62 

Gap Penalties: Existence: 11, Extension: 1 

Number of Hits to DB: 362253949 

Number of Sequences: 1206111 

Number of extensions: 15042333 

Number of successful extensions: 41824 

Number of sequences better than 10.0: 118 

Number of HSP's better than 10.0 without gapping: 75 

Number of HSP'e successfully gapped in prelim test: 43 

Number of HSP's that attempted gapping in prelim test: 41485 

Number of HSP's gapped (non-prelim): 225 

length of query: 491 

length of database: 372,583,10B 

effective HSP length: 60 

effective length of query: 431 

effective length of database: 300,216,448 

effective search space: 129393289088 

effective search space used: 1293932 8908 8 

T: 11 

A: 40 

XI: 16 { 7.3 bits) 
X2: 39 (14.8 bits) 
X3= 64 (24.9 bits) 
SI: 41 (21.7 bits) 
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Microarrays and Toxicology: The Advent of 
Toxicogenomics 

Emile F. Nuwaysir. 1 Michael Bittner. 2 Jeffrey Trent 2 J. Carl Barrett 1 and Cynthia A. Af shari 1 

l Laboratory of Molecular Carcinogenesis. Sational Institute of Environmental Health Sciences. Research Tnangle Park. 
Sorth Carolina 

laboratory- of Cancer Genetics, Sational Human Genome Research Institute, Bethesda. Maryland 

The availability of genome-scale DNA sequence information and reagents has'radically altered life-science 
research. This revolution has led to the development of a new scientific subdiscipiine derived from a combina- 
tion of the fields of toxicology and genomics. This subdiscipiine, termed toxicogenomics, is concerned with the 
icentification of potential human and environmental toxicants, and their putative mechanisms of action, through 
the use of genomics resources. One such resource is DNA microarrays or "chips, * which allow the monitoring of 
the expression levels of thousands of genes simultaneously. Here we propose a general method by which gene 
expression, as measured by cDNA microarrays, can be used as a highly sensitive and informative marker for 
toxicity. Our purpose is to acquaint the reader with the development and current state of microarray technol- 
ogy and to present our view of the usefulness of microarrays to the field of toxicology. Mo/. Carcinog. 24:753- 

7 55, 1 999. © 1 999 Wiley-Liss. Inc. 
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^ - Key words: toxicology; gene expression; animal bioassay 

INTRODUCTION 

Technological advancements combined with in- 
tensive DNA sequencing efforts have generated an 
enormous database of sequence information over the 
past decade. To date, more than 3 million sequences, 
totaling over 2.2 billion bases [1], are contained 
within the GenBank database, which includes the 
complete sequences of 19 different organisms [2]. The 
■irst complete sequence of a free-living organism, 
Haemophilus influenzae, was reported in 1995 [3| and 
was followed shortly thereafter by the first complete 
sequence of a eukaryote, Saccharomyces cervisiae [4]. 
The development of dramatically improved sequenc- 
ing methodologies promises that complete elucida- 
tion or the Homo sapiens DNA sequence is not far 
behind :5i. 

To expioirmore ruily the wealth or new sequence 
information, it was necessary to develop novel meth- 
ods for the high-throughput or parallel monitoring 
of gene expression. Established methods such as 
northern blotting, RNAse protection assays, SI nu- 
clease analysis, plaque hybridization, and slot blots 
do not provide sufficient throughput to effectively 
utilize the new genomics resources. Newer methods 
such as differential display [6], high-density filter 
hybridization [7,8], serial analysis of gene expression 
[9], and cDNA- and oligonucleotide-based microarray 
"chip'' hybridization [10-12] are possible solutions 
to this bottleneck. It is our belief that the microarray 
approach, which allows the monitoring of expres- 
sion levels of thousands of genes simultaneously, is 
a tool of unprecedented power for use in toxicology 
studies. 
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Almost without exception, gene expression is al- 
tered during toxicity, as either a direct or indirect 
result of toxicant exposure. The challenge facing 
toxicologists is to define, under a given set of ex- 
perimental conditions, the characteristic and spe- 
cific pattern of gene expression elicited by a given 
toxicant. Microarray technology* offers an ideal plat- 
form for this type of analysis and could be the foun- 
dation for a fundamentally new approach to 
toxicology testing. 

MICROARRAY DEVELOPMENT AND APPLICATIONS 
cDNA Microarrays 

In the past several years, numerous systems were 
deveioDed for the construction of large-scale DNA 
arravs. All" of these piattorms are cased on cDNAs 
or oligonucleotides immobilized to a solid sup- 
port. In the cDNA approach, cDNA [or genomic) 
clones of interest are arrayed in a multi-well for- 
mat and amplified by polymerase chain reaction. 
The products of this amplification, which are usu- 
ally 500- to 2000-bp clones from the 3' regions of 
the genes of interest, are then spotted onto solid 
support by using high-speed robotics. By using 
this method, microarrays of up to 10 000 clones 
can be generated by spotting onto a glass substrate 



•Correspondence to: Laboratory of Molecular Carcinogenesis. 
National institute of Environmental Health Sciences, 1 1 1 Alexander 
Drive. Research Tnangle Park. NC 27709. 
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Abbreviations: PAH. polycyclic aromatic hydrocarbon; NIEHS. Na- 
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[13. 1-4]. Sample detection for microarravs on glass 
. involves the use of probes labeled with fluores- 
cent or radioactive nucleotides. 

Fluorescent cDNA probes are generated from con- 
trol and test RNA samples in single-round reverse-tran- 
scription reactions in the presence of fluorescentlv 
tagged dLTP (e.g.. Cy3-dLTP and Cy5-dUTP). which 
produces control and test products labeled with dif- 
ferent fluors. The cDNAs generated from these two 
populations, collectively termed the "probe." are then 
mixed and hybridized to the array under a glass cov- 
erslip [10.11.15]. The fluorescent signal is detected 
by using a custom-designed scanning confocal mi- 
croscope equipped with a motorized stage and lasers 
for fluor excitation [10. 1 1. 15]. The data are analyzed 
with custom digital image analysis software that de- 
termines for each DNA feature the ratio of fluor 1 to 
fluor 2, corrected for local background [16,17]. The 
strength of this approach lies in the ability to label 
RNAs from control and treated samples with differ- 
ent fluorescent nucleotides, allowing for the simul- 
taneous hybridization and detection of both 
populations on one microarray. This method elimi- 
nates the need to control for hybridization between 
arrays. The research groups of Drs. Patrick Brown and 
Ron Davis at Stanford University spearheaded the 
effort to develop this approach, which has been suc- 
cessfully applied to studies of Arabidopsis thaliana 
RNA [10], yeast genomic DNA [15], tumorigenic ver- 
sus non-tumorigenic human tumor cell lines [11], 
human T-cells [18], yeast RNA [19], and human in- 
flammatory disease-related genes [20]. The most dra- 
matic result of this effort was the first published 
account of gene expression of an entire genome, that 
of the yeast Saccharomyces cervisiae [21). 

In an alternative approach, large numbers of cDNA 
clones can be spotted onto a membrane support, al- 
beit at a lower density [7.22|. This method is useful 
for expression profiling and large-scale screening and 
mapping or genomic or cDNA clones |7.22-24|. In 
expression profiling on filter membranes, two dir- 
rerent membranes are used simultaneousiv for con- 
trol and test RNA hybridizations, or a single 
membrane is stripped and reprobed. The signal is 
detected by using radioactive nucleotides and visu- 
alized by phosphorimager analysis or autoradiogra- 
phy. Numerous companies now sell such cDNA 
membranes and software to analyze the image data 
[25-27]. 6 

Oligonucleotide Microarrays 

Oligonucleotide microarrays are constructed either 
by spotting prefabricated oligos on a glass support 
[13] or by the more elegant method of direct in situ 
oligo synthesis on the glass surface by photolithog- 
raphy [2&-30]. The strength of this approach lies in 
its ability to discriminate DNA molecules based on 
single base-pair difference. This allows the applica- 
tion of this method to the fields of medical diagnos- 



tics, pharmacogenetics, and sequencing bv hybrid 
ization as well as gene-expression anaivsis. 

Fabrication of oligonucleotide chips by photolj. 
thography is theoretically simple but technically 
complex [29.30]. The light from a high-intensinr 
mercury lamp is directed through a photolitho- 
graphic mask onto the silica surface, resulting i n " 
deprotection of the terminal nucleotides in the ill u - 
rmnated regions. The entire chip is then reacted with 
the desired tree nucleotide, resulting in selected chain 
elongation. This process requires onlv 4n cvcles 
(where n = oligonucleotide length in bases) to'svn 
thesize a vast number of unique oligos. the total num 
ber of which is limited only bv the complexity 0 f the 
photolithographic mask and the chip size [29.3 i .33] 
Sample preparation involves the generation of 
double-stranded cDNA from cellular polv<A>+ R\a 
followed by antisense RNA synthesis in an in vitro 
transcription reaction with biotinvlated or fluor- 
tagged nucleotides. The RNA probe is then frag- 
mented to facilitate hybridization. If the indirect 
visualization method is used, the chips are incubated 
with tluor-Iinked streptavidin (e.g.. phvcoervthrin) 
after hybridization (12.33|. The signal is detected with, 
a custom confocal scanner ( 34 1. This method has 
been applied successfully to the mapping of genomic 
library clones [35], to de novo sequencing by hybrid- 
ization [28.36], and to evolutionary sequence com- 
parison of the BRCA1 gene [37). In addition, 
mutations in the cystic fibrosis |38| and BRCA1 [39] 
gene products and polymorphisms in the human im- 
munodeficiency virus- 1 clade B protease gene [40] 
have been detected by this method. Oligonucleotide 
chips are also useful for expression monitoring !?.*] 
as has been demonstrated by the simultaneous evalu- 
ation of gene-expression patterns in nearlvall open 
reading frames of the yeast strain S. mrevisiae [12]. 
More recently, oligonucleotide chips have been used 
to help identify single nucleotide polymorphisms in 
the human 14 1 1 and veast |42| senomes. 

THE USE OF MICROARRAYS IN TOXICOLOGY 
Screening for Mechanism of Action 

The field of toxicology uses numerous in vivo 
model systems, including the rat. mouse, and rab- 
bit, to assess potential toxicity and these bioassays 
are the mainstay of toxicology testing. However, in 
the past several decades, a plethora of in vitro tech- - 
niques have been developed to measure toxicity, 
many of which measure toxicant-induced DNA dam- 
age. Examples of these assays include the Ames test, 
the Syrian hamster embryo cell transformation as- 
say, micronucleus assays, measurements of sister 
chromatid exchange and unscheduled DNA synthe- 
sis, and many others. Fundamental to all of these 
methods is the fact that toxicity is often preceded 
by. and results in. alterations in gene expression. In 
many cases, these changes in gene expression are a 
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far more sensitive, characteristic, and measurable 
endpoint than the toxicity itself. We therefore pro- 
pose that a method based on measurements of the 
genome-wide gene expression pattern of an organ- 
ism after toxicant exposure is fundamentally infor- 
mative and complements the established methods 
described above. 

We are developing a method by which toxicants 
can be identified and their putative mechanisms of 
action determined by using toxicant- induced gene ex- 
pression profiles. In this method, in one or more de- 
fined model systems, dose and time-course parameters 
are established for a series of toxicants within a given 
prototypic class (e.g., polycyclic aromatic hydrocar- 
b; -is (PAHsh. Ceils are then treated with these agents 
a; j fixed toxicity level (as measured by cell survival ), 
RNA is harvested, and toxicant-induced gene expres- 
sion changes are assessed by hybridization to a cDNA 
microarray chip (Figure 1 ). We have developed a cus- 
tom DNA chip, called ToxChip vl.O, specifically for 
this purpose and will discuss it in more detail below. 
The changes in gene expression induced by the test 
agents in the model systems are analyzed, and the 
common set of changes unique to that class of toxi- 
.nts, termed a toxicant signature, is determined. 

This signature is derived by ranking across all ex- 
periments the gene-expression data based on rela- 

Control 
Population 
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tive fold induction or suppression of genes in treated 
samples versus untreated controls and selecting the 
most consistently different signals across the sample 
set. A different signature may be established for each 
prototypic toxicant class. Once the signatures are de- 
termined, gene-expression profiles induced by un- 
known agents in these same model systems can then 
be compared with the established signatures. A match 
assigns a putative mechanism of action to the test 
compound. Figure 2 illustrates this signature method 
for different types of oxidant stressors, PAHs, and 
peroxisome proliferates. In this example, the un- 
known compound in question had a gene-expres- 
sion profile similar to that of the oxidant stressors in 
the database. We anticipate that this general method 
will also reveal cross talk between different pathways 
induced by a single agent (e.g., reveal that a com- 
pound has both PAH-like and oxidant-like proper- 
ties). In the future, it may be necessary to distinguish 
very subtle differences between compounds within 
a very large sample set (e.g.. thousands of highly simi- 
lar structural isomers in a combinatorial chemistry 
library or peptide library). To generate these highly 
refined signatures, standard statistical clustering tech- 
niques or principal-component analysis can be used. 

For the studies outlined in Figure 2. we developed 
the custom cDNA microarray chip ToxChip vl.O. 

Treated 
Population 




Figure 1 . Simplified overview of the method for sample trative purposes, samples derived from cell culture are depicted 
preparation and hybridization to cDNA microarrayv For illus- although other sample types are amenable to this analysis. 
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Figure 2. Schematic representation of the method for iden- 
tification of a toxicant's mechanism of action. In this method, 
gene-expression data derived from exposure of model sys- 
tems to known toxicants are analyzed, and a set of changes 
characteristic to that type of toxicant (termed the toxicant 
signature) is identified. As depicted, oxidant stressors produce 



consistent changes in group A genes (indicated by red and 
green circles), but not group 8 or C genes (indicated by gray 
circles). The set of gene-expression changes elicited by the 
suspected toxicant is then compared with these characteristic 
patterns, and a putative mechanism of anion is assigned to 
the unknown agent. 



The 2090 human genes that comprise this subarray 
were selected for their well-documented involve- 
ment in basic cellular processes as well as their re- 
sponses to different types of toxic insult. Included 
on this list are DNA replication and repair genes, 
apoptosis genes, and genes responsive to PAHs and 
dioxin-like compounds, peroxisome proliferators. 
estrogenic compounds, and oxidant stress. Some of 
the other categories of genes include transcription 
factors, oncogenes, tumor suppressor genes, cvclins. 
kinases, pnosphatases. ceil adhesion and motility 
genes, and homeobox genes. Also included in this 
group are 84 housekeeping genes, whose hybridiza- 
tion intensity is averaged and used for signal nor- 
malization of the other genes on the chip. To date, 
very few toxicants have been shown to have appre- 
ciable effects on the expression of these housekeep. 
ing genes. However, this housekeeping list will be 
revised if new data warrant the addition or deletion 
of a particular gene. Table 1 contains a general de- 
scription of some of the different classes of genes 
that comprise ToxChip vl.O. 

When a toxicant signature is determined, the 
genes within this signature are flagged within the 
database. When uncharacterized toxicants are then 
screened, the data can be quickly reformatted so that 
blocks of genes representing the different signatures 



are displayed [11]. This facilitates rapid, visual in- 
terpretation of data. We are also developing Tox- 
Chip v2.0 and chips for other model systems, 
including rat. mouse, Xenoptis, and veast, for use in 
toxicology studies. 

Animal Models in Toxicology Testing 

The toxicology community relies heavily on the 
use of animals as model svstems ror toxicology test- 
is;;. L'nrortunateiv. these assavs are mnerentiy ex- 
pensive, require large numbers of animals and take a 
long time to complete and analyze. Therefore, the 
National Institute of Environmental Health Sciences 
(NIEHS), the National Toxicology Program, and the 
toxicology community at large are committed to re- 
ducing the number of animals used, by developing 
more efficient and alternative testing methodologies. 
Although substantial progress has been made in the 
development of alternative methods, bioassays are 
still used for testing endpoints such as neurotoxic- 
ity, immunotoxicity, reproductive and developmen- 
tal toxicology, and genetic toxicology. The rodent 
cancer bioassay is a particularly expensive and time- 
consuming assay, as it requires almost 4 yr, 1200 
animals, and millions of dollars to execute and ana- 
lyze [43]. In vitro experiments of the type outlined 
in Figure 2 might provide evidence that an unknown 
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Table 1. ToxChip v1.0: A Human cONA Microarray 
Vhi o Designed to Detect Responses to Toxic Insult 

No. of genes 



Genecategorv_ on cnip 

Apcotosis 7 2 

ON- replication and reoair 99 

Oxiaative stress/reoe*-Romeostasis 90 

Peroxisome proiiferator responsive 22 

Dioxtn/PAH responsive 1 2 

Estrogen responsive 63 

Housekeeping 84 

Oncogenes and tumor suppressor genes 76 

Ceil-cycle control 51 

Transcription factors 131 

Kinases 276 

p— : pnatases 88 

He-j'.-shock proteins 23 

Receptors 349 

Cytochrome P4SQs 30 



•This list is intended as a general guide. The gene categories are not 
umaue, and some genes are "Steo in multiple categories. 

agent is (or is not) responsible for eliciting a given 
biological response. This information would help to 
sp'ect a bioassay more specifically suited to the agent 
i: question or perhaps suggest that a bioassay is not 
necessary, which would dramatically reduce cost, 
animal use, and time. 

The addition of microarray techniques to stan- 
dard bioassays may dramatically enhance the sen- 
sitivity and interpretability of the bioassay and 
possibly reduce its cost. Gene-expression signatures 
couid be determined for various types of tissue-spe- 
cific toxicants, and new compounds could be 
f --eened for these characteristic signatures, provid- 
ing a rapid and sensitive in vivo test. Also, because 
gene expression is often exquisitely sensitive to low 
doses of a toxicant, the combination of gene-expres- 
sion screening and the bioassay might allow the use 
of lower toxicant doses, which are more relevant to 
human exposure levels, and the use of fewer ani- 
mals. In addition, ^ene-expression changes are nor- 
mally measured in hours or days, not in the months 
to years required for tumor development. Further- 
more, microarrays might be particularly useful for 
investigating the relationship between acute and 
chronic toxicity and identifying secondary effects 
of a given toxicant by-studying the relationship 
between the duration of exposure to a toxicant and 
the gene-expression profile produced. Thus, a bio- 
assay that incorporates gene-expression signatures 
with traditional endpoints might be substantially 
shorter, use more realistic dose regimens, and cost 
substantially less than the current assays do. 

These considerations are also relevant for branches 
of toxicology not related to human health and not 
using rodents as model systems, such as aquatic toxi- 
cology and plant pathology. Bioassays based on the 
flathead minnow, Daphnia, and Arabadopsis could 



also be improved by the addition of microarrav analv- 
sis. The combination of microarrays with traditional 
bioassays might also be useful for investigating some 
of the more intractable problems in toxicology re- 
search, such as the effects of complex mixtures and 
the difficulties in cross-species extrapolation. 

Exposure Assessment, Environmental Monitoring, 
and Drug Safety 

The currently used methods for assessment of ex- 
posure to chemical toxicants are based on measure- 
ment of tissue toxin levels or on surrogate markers . 
of toxicity, termed biomarkers (e.g., peripheral blood 
levels of hepatic enzymes or DNA adducts). Because 
gene expression is a sensitive endpoint. gene expres- 
sion as measured with microarray technology may 
be useful as a new biomarker to more precisely iden- 
tify hazards and to assess exposure. Similarly, 
microarrays couid be used in an environmental- 
monitoring capacity to measure the effect of poten- 
tial contaminants on the gene-expression profiles 
of resident organisms. In an analogous fashion, 
microarrays couid be used to measure gene-expres- 
sion endpoints in subjects in clinical trials. The com- 
bination of these gene-expression data and more 
established toxic endpoints in these trials could be 
used to define highly precise surrogates of safety. 

Gene-expression profiles in samples from exposed 
individuals could be compared to the profiles of the 
same individuals before exposure. From this infor- 
mation, the nature of the toxic exposure can be de- 
termined or a relative clinical safety factor estimated. 
In the future it may also be possible to estimate not 
only the nature but the dose of the toxicant for a 
given exposure, based on relative gene-expression 
levels. This general approach may be particularly 
appropriate for occupational-health applications, in 
which unexposed and exposed samples from the 
same individuals may be obtainable. For example, 
a pilot study of gene expression in peripheral-blood 
tvmphocvtes of Polish coke-oven workers exposed 
:o PAHs i and many otner compounds t is under con- 
sideration at/the NTEHS. An important consideration 
for these types of studies is that gene expression can 
be affected by numerous factors, including diet, 
health, and personal habits. To reduce the effects 
of these confounding factors, it may be necessary 
to compare pools of control samples with pools of 
treated samples. In the future it may be possible to 
compare exposed sample sets to a national database 
of human-expression data, thus eliminating the 
need to provide an unexposed sample from the same 
individual. Efforts to develop such a national gene- 
expression database are currently under way (44,45). 
However, this national database approach will re- 
quire a better understanding of genome-wide gene 
expression across the highly diverse human popu- 
lation and of the effects of environmental factors 
on this expression. 
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ceU s development or response and should help in the elucidation of specinc and 
sensitive biomarkers representing, for example, different types of cancer or previous 
— exposure to certain classes of chemicals that are enzyme inducers. 

In drug metabolism, many of the xenobiooc-metabolizing enzvmes (including 
the well-characterized isoforms of cytochrome P*S0) are inducible bv drugs and 
cnemical, in man (Pelkonen et al. 1998). predominantly involving transcriptional 

«3r n ? T ""^ *• e ° ,a *" W*™" genes, but additional cellular 
proteins which may be crucial to the phenomenon of induction. Accordinglv. the 
development of methodology to identify and as.es, the full complement of gene, 
that are either up. or down, regulate* by inducers are crucial in the development of 
knowledge to understand the preci* molecular mechanism, of enzvme induction 
Z*° dru *. actl0n - Simihrljr. in the field of chemical-mduced 
toxicity it is now becoming increasingly obviou, that most adverse reactions to 
drugs and chenucals are the reauh of multiple gene regulation, some of which are 

rS i rotS^f ,e f r — » *• Ecological Phenomenon per 
^Th. obsenanon ha, led to an upsurge in interest in gene-profiling technologies 
» hich differentiate between the control and toxin-treated gene pools in target tissues 
and ,, therefore, of value in rationalizing the molecular mechanism, of xenobioric- 
mduced toxicity. Knowledge of toxin-dependent gene regulation in target tissues is 
not solely an academic pursuit as much interest ha, been generated in the 
pharmaceutical industry to harness this technology in the early identification of toxic 
drug candidates thereby shortening the developmental process and contributing 
substantially to the safety assessment of new drug,. For example, if the gene profile 
in response to say a testicular toxin that has been well-characterized in vivo could be 
determined in the testis, then this profile would be representative of all new drug 
candidates which act via this specific molecular mechanism of toxicity, thereby 
providing a useful and coherent approach to the early detection of such" toxicants'. 
Whereas it would be informative to know the identity and functionality of all genes 
up/down regulated by such toxicant,, this would appear a longer term goal, as the 
majority of human gene, have not yet been sequenced, far less their functionality 
determined. However, the current use of gene profiling yield, a potter* of gene 
change, for a xenobiotic of unknown toxicity which may be matched to that of well- 
characterized toxins, thus alerting the toxicologic to possible in «ro similarities 
between the unknown and the standard, thereby providing a platform for more 
extesore toxicologic^ examination. Such approaches are bennning to gain 
momentum, in that several biotechnology companies are commercials producing 
gene chips or 'gene arrays' that may be interrogated for toxicity assessment of 
xenobioncs. These chips consist of hundreds/thousands of genes, some of which are 
degenerate, in the sense that not all of the genes are mechanistically-related to any 
one topological phenomenon. Whereas these chips are useful in broad-spectrum 
screening, they are maturing at a substantial rate, in that gene arravs are now 
becoming more specinc. e.g. chips for the identification of changes in growth factor 
families that contribute to the aetiology and development of chemically-induced 
neoplasias. 

Although documenting and explaining-these genetic change, presents a 
formidable obstacle to understanding the different mechanism, of development and 
disease progression, the technology is now av^hible-to begin attempting this difficult 
challenge Indeed, several 'differential expression analysis' methods have been 
developed which facilitate the identification of gene products that demonstrate 
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Differential gene expression n . . 

altered expression in cells of one population compared to another These mrn« fl 
have been used to identify differential gene expression in rnanv ^o^^Z 
mvading pathogeny microbes (Zhao et at. , 998). in cells responding ^^SZ 

JSST^iS^L?^^- » chemi^y ^ed cells (Syed et aL 1997^! etaL 
1999), neoplastic cells (Liang et al. 1992, Chang and Terza*h»-Howe 1998) 
a t.vated cell, (Gursluva et ^ 1996. Wan et aL 1996). differentiated cells Haralr 

t^t 8 ; €t aL 1W8) ' A, * OU ' h di «<™ expression an Ks« 

uh^f! . °" ***** ab$oIutc 'V no prior knowledge of the specific genes 

u hich are up- or down-regulated is required. * 

The field of differential expression analysis is a large and complex one with 
several methodological approaches, including: 

(1) Differential screening, 

(2) Subtract™ hybridization (SH) (mcludes methods such as chemical cro»- 
hnkmg aubrracno^CLS. suppression-PCR subtracrive kZSJSZ- 

SSH. and representational difference analvsis-RD A) 

(3) Differential displav (DD). 

<4) exprTrio^^^^ (induding ™ l >™ <> { *«" 

«pression— SAGE-and gene expression nngerprinnng-GEF) 

(?) Gene expression arrays, and 

(6) Expressed sequence tag (EST) analysis. 

The above approaches have been used successful^ to isolate differentials 
subtle (and sometune, not so subtle) characteristic, which mcur various »dv««^ 

^^^"ir^ ? e raain differennai expre "'° n 

mghlight some of the broader considerate, and implications of this verv powerful 
and .ncreasmglv popular techn Jque . Specially, we will concentrate » Ae t- 
called open system,, namely those which do not require anv knowledge of 
sequence, and. therefore, are useful for Elating unknown genT^o -do,!? 

5 Z te ^^A ^^^ 0,Ml li dmni,eXi * ene s «* uenc ">- EST anaJyTu «d the 
™" ° UiS * ""HI ««r be cetmdered briefly for comoleteness White 

emphasu will often be placed on suppress.on PGR subtracts hvbnd^Ton (SSH 
the approach employed in th„ laboratory), it i, the a,m of the author hTgS 

differennal gene expression analysis. - - «>use. 
Differential cDNA library screening (DS) 

brou^»nelMl7 10Prnent ° f mU,tiP ' e ^^al advances which have recently 

dSerTSlv ""P 0 """ °'*ff<«ntial 8«e expression and characterization of 
d.fferen«ally expressed gene, has existed for many vears. One of the oriiinal 

uav,, (1979). These authors developed a method, termed • differential plaque filter 
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Produce clones Label directly and probe librvy " " 

ngure I. The hvdroxv»p,«te method of .ubtnct.ve hvbnduation cOS\ denv«4 fr— *~ 

containmg , fiction site) llgated to ^ ^ fi population, are then 
amplified by PGR, but the driver cDNA population is subsequent* d£a£ Jj. 
the adaptor^ntanung restriction endonudease. This serves to cleave the Toliw- 
vector and reduce the amplification potential of the control population T^igeaS 
control populate i, then biotinylated and an excess mLd w"h e«e f cDN^ 
Followmg denaturanon and hybridization, the m« is applied to a biocytin column 
(strepuv.d.n may also be used) to remove the "control populati^dS 
heteroduplexe, formed by sealing of common sequence, from the «S 
populanon. The procedure « repeated several times followmg the addTtion of^S 
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Differential gene expression 



sr)mRNA 



control cDNA. In order to further enrich those species differentially expressed m 
the tester cDNA. the subtracted tester population is amplified by PCR following 
every second subtraction cycle. After six cycles of subtraction (three reamplincation 
steps) the reaction mix is ligated into a vector for further analysis. 

In a slightly different approach, Hara et al. (1991) utilized a method whereby 
oligo(dT w ) primers attached to a latex substrate are used to first capture mRNA 
extracted from the control population. Following 1st strand cDNA synthesis, the 
RNA strand of the heteroduplexes is removed by heat denaturanon and centn- 
fugation (the cDNA-oligotex-dT M forms a pellet and the supernatant is removed). 
A quantity of tester mRNA is then repeatedly hybridized to the immobilized control 
(driver) cDNA (which is present in 20-fold excess). After several rounds of 
hybridization the only mRNA molecules left in the tester mRNA population are 
those which are not found. in the driver cDNA-oiigotex-dT„ population. These 
tester-specific mRNA species are then convened to cDNA and, following the 
addition of adaptor sequences, amplified by PCR. The PCR products are then 
ligated into a vector for further analysis using restriction sites incorporated into the 
PCR primers. A schematic illustration of this subtraction process is shown in figure 

However, all these methods utilising physical separation have been described as 
inefficient due to the requirement for large starting amounts of mRNA, significant 
loss of material during the separation process and a need for several rounds of 
hybridization. Hence, new methods of differential expression analysis have recently 
been designed to eliminate these problems. 
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Chemical Cross-Linking Subtraction (CCLS) 

In this technique, originally described by Hampson et al. (1992), driver mRNA 
is mixed with tester cDNA (1st strand only) in a ratio of > 20: 1. The common 
sequences form cDNA:mRNA hybrids, leaving the tester specific species as single 
stranded cDNA. Instead of physically separating these hybrids, they are inactivated 
chemically using 2,5 diazindinyl-h-V-benzoqumone (DZQ). Labelled probes are 
then synthesized from the remaining single stranded cDNA species- tunreacred 
mRNA species remaining from the driver are not converted into probe material due 
to specificity of Sequenase T7 DNA polymerase used to make the probe) and used 
tq screen* cDN A library made from the water cell population. A schematic diagram 
of the system is shown in figure 3 . 

It has been shown that the differentially expressed sequences can be enriched at 
least 300-fold with one round of subtraction (Hampson et al. 1992), and that the 
technique should allow isolation of cDNAs derived from transcripts that are present 
at less than 50 copies per cell. This equates to genes at the low end of intermediate 
abundance (see table 1). The main advantages of the CCLS approach are that it is 
rapid, technically simple and also produces fewer false positives than other 
differential expression analysis methods. However, like the physical separation 
protocols, a major driwback with CCLS is the large amount of starting material 
required (at least 10/ig RNA). Consequently, the technique has recently been 
refined so that a renewable source of RNA can be generated. The degenerate random 
oligonucleotide pnmed (DROP) adaptation (Hampson et al. 1996. Hampson and 
Hampson 1997) uses random hexanucleotide sequences to prime solid phase* 
synthesized cDN A. Since each primer includes a T7 polymerase promotor sequence 
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Suppression PCR Subtraetite Hybridisation (SSH, 

The most recent adaptation of the SH approach to differential expression 
analysis was first described by Diatchenko et ai. (1996) and Gurskava et al. (1996. 
They reported that a 1000-3000 fold enrichment of rare cONAs (equivalent to 
isolating mRNAs present at only a feu. copies per cell) can be obtained xmhout the 
need for multiple hybridizations/subtractions. Instead of phvsical or chemical 
Su^eS) me COmm ° n le 9 uences - a PCR-based suppression system is used (see 

In SSH excess driver cDNA is added «mco portions of the tester cDNA which 
tZl "a"* ' r Afferent adaptors. A first round of hybridization serve, ,o 
ennch differently expressed genes and equalize rare and abundant messages. 
Equalization occur, since reannealing is more rapid for abundant molecules than for 
; a 0fi^ m "i eCU l ° AC ,eC ° nd ° rder kuiel,cs °f hybridization , James and Higgms 
Jit ^ ^° '"TV? h >' bridtt ""> n «*»» »" *en mixed together in the presence 

J£z£z . to hybridize funher - $tep Pem " tt the * 

single stranded complementary sequences which did not hvbridize in the primarv 
hybr.duat.on. and in doing so generates templates for PCR amplification. .Although 
there are several possible combinations of the single stranded molecules present in 
the secondary- hybridization mix. only one particular combination (differentially 
expressed ,n the tester cDNA composed of complimentary strands having different 
adaptors) can amplify exponentially. * ™nerent 

of 8 ° b r ned J th / f fiMl diff «««al display, two options are available if cloning 

of cDNAs is desired. One i, to transform the whole of the final PCR reaction into 
!rT«r m ?K ' Tran,fonned co,oni « c » then be isolated and their .ruert. 
characterized by sequencing, restriction analysis or PCR. Alternatively the final 

and SZ TH" 1 ? td ° n * 861 the « ndivid "*l seised, ^amplified 
hL^Z i ^ ^ Pr0aCh " techn,caI1 >- sim P»« » d ««• tune conauming. 
Howexer Ugation/transformation reactions are known to be biased towards die 
cloning of smaller molecules, and so the final population of clones will probablv not 
contain a representative selection of the larger products. In addition, although 
equahzanon theoretically occurs, observation, ,n this laboratory suggest that this is 
^•T"" u acCom P i,shed Consequently, some gene spec.es are present 

( a . h,ghe i^ umber *» othe » « d this will be represented in the final population 
IJl J!"* m ° rd "- t0 ° blain asubstanaaJ Proport.on of those gene species that 
d^JZT!?!!™ amemtiai "P^onnn the tester populanon. the number or 
clone, mat w,U have to oe screened after this step may be substantial. The second 
approach is ininally more tune consuming and technically demanding. However it 
would appear to offer better prospects for cloning larger and low abundance gel 
products. In addition, one can incorpoFaVa screening step that differentiate, 
different product, of different sequences but of the same size (HA-staining. see 
later). In this way. a good idea of the final number of clones to be isolated and 
identified can be achieved. 

An alternative (or even complementary) approach !, to use the final differential 
display reacnon to screen a cDNA library to isolate full length clone, for further 
characterization, or a DNA array (see later) to quickly identify known genes. SSH 
has been used in this laboratory to begin characterization of the short-term gene 
IS' T"! ,1 ««yme-uiducer, »uch as phenobarb.tal (Rocken et al. 1997) 
and Vvy.14.643 (Rockett et al. unpublished observations). The isolation of 
differentially expressed genes in this manner enables the construction of a fingerprint 
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of expressed genes which are unique to each compound and time/dose point. Such 
.nformauon could be useful in short-term characterization of the toxic Pot^tfaUf 
IZilTTi^ y . eo J nparin « the ^-expression profiles they elicit with those 
produced by known mducers. Figure 6 show, a flow diagram of the method used I 

UST* T^"? ,° ne * fa «* n * «P"» ed «* fi^« 7 show?e7p«»ion 
profile, obtained from a typical SSH experiment. Subsequent sub-dorung o"*e 
md.v.du.l bands, sequencing and gene data base interrogate reveal, m»y gene! 
wh,ch are ether up- or down-regulated by phenobarbital in the rat (table, 2 and 3). 

requtred of wh.ch ,pec,fic gene, we up/down-regulated ,ub,equent to xenobiotic 
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irge number of cellular 
:ifferentiation (Levy et 
rent in the phenomena 
:t is intriguing, and 
iinerentially regulated 
ae oi this approach is 
atabase sequences, but 
eenes of completely 
rail assessment of the 
g the lack of complete 
?enc profiling studies 
- xenobiotic challenge, 
for further detailed 
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Tabic, 



Differential gem expression 
° trm u P-«fuiared in rat liver following Jnlay exposure to phenooaroiu: 



Sand number 
(approximate 
toe m bp) 



Highest sequence 
aimtianrv 



FASTA-EMBL gene identification 



5 (1300) 


93.5 % 


: (tooo) 


95.1 ° # 


S (950) 


98.3 \ 


10 (850) 


95.7 % 


1 1 (800) 


Done 1 94.9 •» 




Clon« 2 75.3 \ 


12 (750) 


93.8\ 


15 (600) 


92.9% 


16(55) 


Clone 1 95.2% 




Clone 2 93.6% 


21 (350) 


99.3% 



CYP2B1 

Prcproalbumm 

Serum albumin mRXA 

NCUCCAP.Prl H laptew (EST) 

CYP2B1 

CYP2B1 

CYP2B2 

TRPM.2 mRXA 
Sulfated glycoprotein 
Preproalbumin 
Serum albumin mRXA 
CYP2BI 

Haptof lobulin mRXA partial alpha 
18S. 5.SS k 28S rRXa 



^Irl^L^S fPeCtnjm W *™ Wtuch m «P-«f«i«ed ,n rat l,ver«bv phenobXtaY but 
«mpl> represents the genet sequenced and identified to date. ' Du 



Table 3. 



Genei down-regulated in rat liver following 3. day ecpoture to phenobarbital. 



Band number 
t approximate 
stae tn bp) 



Highest sequence 
similarity 



FASTA*EMBL gene identification 



1 (1500) 
2 (1200) 
3(1000) 
7(700) 



8 (650) 
9(600) 

10 (550) 

11 (525) 

12 (375) 

13 (23) 



14(170) 
15(140) 
Other* : (300) 
(275) 



Clone 1 
Clone 2 
Clone 3 
Clone I 
Clone 2 
Clone 1 
Clone 2 



Gone 1 
Clone 2 
Clone 3 



95.3% 
92 J % 

91.7% 
77.2% 
94.5% 
91.0% 
86.9% 
96.2% 
86.9% 
82.0% 
73.8% 
95 .7% 
100.0% 

97.; *> . 

100.0% 
100.0% 
96.0% 
97.3% 
96.7% 
93.1% 



3-oxoacyLCoA thiolaae 
Hemopoxin mRXA 
Alpha.2u-globuhn mRXA 
M.muLsaduM CI inhibitor 
Electron transfer fisvoprotetn 
-V. musculuj Topoisomerase 1 (Top© 1) 
Scares 2NbMT .V musculo* (EST) 
Alpha-2u-globuhn (»-rype» mRXA 
Soares mouse XML St. mutcuiut (EST) 
Scares pJXMF 19.5 Sr. mumiui (EST) 
Soares mouse XML St. mtsatiyj (EST) 
NCl-CCAP-Prl H. tapim (EST) 
RibosomaJ protein 

Soares mouse embrvo NbMEl 35 (EST> 
Fibrinogen £Uocta-cruun 
Apohpoprotem E gene 
Soares p3XMF19.5 M. musculus (EST) 
Stratagene mouse testis (EST) 
&. norvtrnu RASP 1 rnRNA 
Soares mouse mammary gland (EST) 



+JSL «^fr? Bands 4^6 were shown to be false positive, bv dot blot analysis and, 

^^L lLZZ^T^^ ^nved from Rockenel^. (1997). It should be noted that me^Ve g™, 

t^^lTZ^^ ,PeCmttn ° f wh>ch m ^-regulated tn rat l.ver by phe^blS 
but atmiply represenu the genes sequenced and identified to date. P ™* 



display (DD). In this method, all the rnRNA species in the control and treated cell 
y primed PCFT (Liang Stp?IT amp ? fied * Scparate "actions using reverse transcriptase. PCR 

rred to as 4 differential L I £\ products are then run side-by-side on sequencing gels. Those 

bands which are present in one display only, or- which axe much more intense in one 
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- be carried our-2 dav^ 2£ 25* 7T " $Peed W " h uhlch " c « 
clones. * ° bttin ' dttp,ay 111(1 » h «le « « week to make and identify 

™~^£oX^™™ b Z d °° — - of p„m,ng tne 

« the 3'.end. e.g. j- (dT ^A 3 ' ^ 

P™er Ly b^^^^^ Ahern,t,v e »y. ,„ 

This variant of R.\ A fin^Hnri«L k ^ ? NA »> wh «" Welsh « «,/. 199;) 
Pntned) PCR. Onea^eT^ 

(Wong and McClelWdT^ T bo t ^ " ^ ba ^ 
denaturation. .econd strand f nvA .^J™-* IOUow,n « "verse 

transcription and 

Arbitrary primer, have a ^ •^~i eB ~ ^ "** « «W««y 

primer 

(Guimeraes« a /. 1995a) sS*i'i m *\ " n0t a,wa >' s case 
shows vanat,on between^ ^ ^^^j?** " " d 
matched with their genes evJ T,L u ' dent,fied b > - °D cannot alwavs be 
«3) The pattern of ditCnn J Z J? * ^ ,dCm, * ed - 

(Sun er a/. 1994). Some adaputTons ht/tTT " " P l ° 10 * *«— 

including the use of v^^™' 10 reduce fahe 

comparison of unbdLdanX^ ^ Denma " ,W > 

««d como«ison of DDPCR.producl f»m Zl T"! (Bu ™""-l**> 
.. line. (Sompayrac rt «/. 1995).^ ^™^ 7" ^ induced 
cytoplasmic RNA rather then Sal f£a SS. L° rCP ° Red ** — ° f 
nuclear RNA that is not transponed^e"^"^" P °" nV " *™ 

-^sses of the DD 



Differential gene expression 



i may be recovered for 
peed with which it can 
k to make and identify 

ethods of priming the 
with a 2 -base ' anchor * 
92). Alternatively, an 
is (Welsh et al. 1992). 
AP* (RNA Arbitrarily 
PCR products may be 
frames. In addition, it 
lany bacterial mRNAs 
erse transcription and 
tth an arbitrary primer 
compared to random 
>sition). The resulting 
on the system (primer 
lally includes 50-100 
Tibmation of different 
species from a cell can 
•pulations are analysed 
i can be identified and 
lysis. 

1 today for identifying 
ceived disadvantages: 

iRNAs (Bertioli et al. 
id the isolation of very 
stances (Guimeraes et 

3' end of the mRNA 
or always be the case 
uded in Genbank and 
DD cannot always be 
;d. 

piay often cannot be 
mup to 70 ° 0 of cases 
reduce false positives, 
and Denman 1997). 
urse (Burnet al. 1994) 
ced and two induced 
ported that the use of 
positives arising from 

aknesses of the DD 
• al. (1996) and from 



mRNA 



(dnt)CA: AC 



'AAAAAAAA 

Amitrary anmer 




1" strand eONA 
<#- AC 

UGAAAAAAA 



I'svano cONA 
4 



-AAAAAAA 



Denature and synthesis* 2* strand 
wnn any aroitrary pnmer ( ) 



2""sirana cDNA 



-AC 



2"* strand cONA 



i 



cONA can now be amoved by PCR using onginal pnmer pair 

Figure 8 Two approaches to ^differential display (DD) analysis. 1" strand synthesis can be earned out 
either with a polydT n NN pnmer (where N - C. C or A) or with an arburarv pnmer. The use of 
different combinanons of C. C and A to anchor the first strand polydT pnmer enables the pnmine 
of the majority of poiyadenyUted mRNAs. Arbitrary pnmeri mav hvbndue at none, one or more 
places along the length of the mRNA. allowing i w strand cDNA svnthesis to occur at none one 
or more point, m the same gene. In both cases. 2" strand synthesis is carried out with an arimrarv 
pnmer. Since these arbitrary pnmers for the 2 m strand mav also hvbndue to the r jtrand cDNA 
in i number of different places, several different 2 m strand products mav be obtained from one 
binding point of the !" strand pnmer. Following 2 M strand synthesis, the onginal set of pronen 
is uaedto amplify toe tecood strand products, with the result that numerous sene sequences are 
amplified. 

Restriction endonuclease-facilitated analysis of gene expression 
Serial Analysis of Gene Expression (SAGE) 

A more recent development in the field of differential display is SAGE analysis 
(Velculescu et al. 1 99S). This method uses a different approach to those discussed so ' 
far and is based on two principles. Firstly, in more than 95 % of cases, short 
nucleotide sequences {' tags-') of- only- nine or 10 base pairs provide sufficient 
information to identify their gene of origin. Secondly, concatenation (linking 
together in a series) of these tags allows sequencing of multiple cDNAs within a 
single clone. Figure 9 shows a schematic representation of the SAGE process. In this 
procedure, double stranded cDNA from the test cells is synthesized with a 
b !?. t ! n . y . 1 * tcd P° l y dT P™«- Following -digestion with a commonly cutting (4bp 
recognition sequence) restriction enzyme Ctnchoring enzyme'), the 3' ends of die 
- cDN A. population are captured with streptavidin beads. The captured population is 
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whidr-cut, DNA at . defined dttt»cT(< 2o Lw re «««,on enzyme-™ 
H««.follow» gdigesiionofetch( ^^ ^ridVo^"; m sequence, 
the adaptors plus , lhorl Dieee 0 /Jk. fWpu,anonw,ththe "Sen*%-me 

PopuJa«om„ ethenli ^" d ^! f eBpBlwd ««>NA are released. Tnt'mo 

the process) and cloned. The ad* *nta£ of£i ^ " led (c ° nc "°mer, formed in 
c» be .demified by« q uencingonK-X d t JT"" " dm hundredt «*™ tags 
a gn-en transcnpt i, identified' i, o^ZT m0n '^ n ^ heToiim ^ 
abundance in the originaJ Donula^ q "J* nt,tat,ve measurement of that gene', 
di ff eren MUye ^ rew ^^ wfcch faai.tates .dentificatfoTo 

Some disadvantages of S*TF , ? Populanons. 

mRXAs. ha, not been vali<£ed inT "2 * " re<,U,red - bia$ed ">*ards abundant 
«* been used to «amin e J^^SSS^ ~** » d * 



C "T Fi *gerprinti„g (CEF) 

A dlftermr -« , 



A differe t ^^ nnfM * ( <*EFl 

^^^^SXI^^^x aPPr ° aCh f ° r ,S ° ia " n « differentially 
method. RXA i, converted tain ? L * ° V * ^vskv (1995) IntS 
cDXA population S^^^T^JJ?^ pnmers 

2? " ic microbead, t ^ X^L^T? -1 * "* eWd 

produce. The use of restricted 3'-end Hta s rvTto 1 T""* S ' 

nor mt P °° l ■* heJ P» » ensure ^ RV A " ^ COra P ,e3d * * *e 

not more than one restriction product £ ad^ r ! PeC,ei " re P"»ented by 
amphficanon of the captured popXion PGR " " » u ^quen\ 

specfic and one biotinylated po*™ Dn W ° W *** ° ne 

recapmredandthenon-biotinvlated'stL^? ^ ream P Iifi ed Population is 

non-biotinvlated strand i, Aearty^hf remov «d hy alkaline dissociation. The 
Pterin the p^^ 

end, are next sequenriaJlvtreatedwithaseri^ 

P ; 0dUCtS ^ ^°n S bv ptc E m -^ tr,Ct,0n endon «lea,e, 
composed o, a number of ladders wS^l^ " 3 

By comparing test versus control Sn^Z^i, ^ " SeqUenUal 
expressed product, which can the^soU 1 T^'l " '""^ di *«entiallv 
advantage, of thi, procedure „ e ^ be from rhe gel and cJoned . ^ 

author, esnmate th„ 80-93 % of &S A ^Z£T rcproducibl «. ™* the 
fingerprint. The disadvantage i, th a , , Ul " are ,nvo,ve d in the final 

estimated to be produced tn an avera« i~ • th « 1000 or more which are 
*"« d -"bed by Uinerlinden "7%$^^ * ■* such" 
overcome this problem. - { ' * nd Hatada " (1 00 D may help to 

? ^Tb;^ foment, wa, later 

dagestaonofmeu^oboluedT^^J 1 ,!^^ of 
compared the profiles oC Ae Ebntrol^l^S; 6 " ?' aUth °" 
-manipulation ma ^""ed-populanons without further 



ch ?roup. Incorporated 



stncuon enzyme— one 
recognition sequence, 
•n with the IIS enzyme, 
ire released. The two 
amplified products are 
.atomers are formed in 
: hundreds of gene tags 
re, the number of times 
•ement of that gene's 
Htates identification of 

hnicai difficulty of the 
ased towards abundant 
nomic setting and has 
date. 



isolating differentially 
.avsky (19951 In this 
Ugo(dT) primers. The 
sase and captured with 
* unwanted 5' digestion 
e the complexity of the 
■ecies is represented by 
to facilitate subsequent 
out with one adaptor- 
nplined population is 
ahne dissociation. The 
:erent adaptor-specific 
immobilized 3' cDNA 
-motion endonucleases 
r resuit is a fingerprint 
quer.nai digests used i. 
5 identify differentially 
gel and cloned. The 
reproducible, and the 
involved in the final 
3n rarely resolve more 
0 or more which are 
se of 2-D gels such as 
(1991) may help to 

se fragments was later 
instead of sequential 
. these authors simply 
tions without further 



Differential gene expression 
AAA A 



o. . 



1" strand cONA synmesis using 
Dctmytatec pory dT primers 



1777 



1 



CTAC 



cONA cUMd witn A£ anc 
captured with streptamn eeaos 

"■AAAA 



OTAC 



0 

TTTXmU 



Drtfemhaffandfigate inkers \^ 



CATG. 
CTAC. 



CATC- 
CTAC , 



CATC- 
CTAC 

CATC 
CTAC- 



Cleave with tagging enzyme (IE) 
and proouce Mum ends 



GG4TOCA TQ00O0Q0 Oa 
CCTACCTA C w xxxxxxx 



GGATGCATCOOOOOOOOO 
CCTACGTACO0OOO00O0 



j Ugate and amplify 

GG»T0CAT©O0OaxXXX00O000O00CATGCATCC 77 
CXT4CGTA CX XXmTOCm)00000GTACGTAGC K 



DiTag 



Cla»e wtt a£, was aiTags. 
concatenate, aone ana 
sequence 



J^L ae 

— CATGXXXXXXXXXOOOOOOOOOCATG XXXXXXXXX00O0O0O00CATG— 
— GTA CXXXXXXXX X OOOOOOOOQ GTAC WXXXXXXX00O0O0O00G7AC— 

Tagl Tag2 Tag3 Taj 4 

^2"ti«^? 5%*™* ,trepuvldin The cDN A pool », divided in J£££ 
¥£rJ£* * ^T™ Imk ff- emch » type IIS restriction ,«e (Ufrnt^S 

JUUULK and OOOOO indicate nucleotide* of difTerent tags). The two pool, of un are th«i 
— Xtho^ t^r^" m *! llB *7 P * GE The *»H « then Heated (dunng 
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DNA arrays 



the aforementioned step, prod L« ,22^2?'? °J T * PC * Ev « S0 ' each of 
of gene expression. TW^^tJSS L° " lnmate *°«» * rap,d an.lv,,, 
so-called DNA array, <e.g P £esT, ^ £ a J 1*7^' ** of 
the mtroduction of which ha. ,,gnalled !? 2, Zh *° " aL 199S - Sci >«» « cl. 19o 6) . 
analysis. DN'A array, j£ X rZLT " ^T^' e * pre »™ 

hundreds or thousands of DXA spoa Ton °/ * Chip$ ' 

a known gene. The genes are often Sej£^^ ° f muinplc COD '« P"t of 
•n oncogenesis, cell cvcling DV? « „ 1" ! ■ °° prev,ous| V P«ven involvement 
They are usu^ycho^^speX^ 

Human and mouse array, m"^ eB ^, ta ^»«^"«-liP«i 

will construct a personated amTto f >' aVa ' lab ' e ' few 

Re^arch Genet.c. Inc. °T "* Ubo ™^ »d 

of genes can be spotted on a tinJSZZ^"?" ° r even *—«■" 

populanons can be labelled and Jed L i rnRNA/cDN'A from the test 

appropriate hardware and sofr^e^L^ff '' " When — 

assess differences in gene ex^sTn ' Mpid ^ ou * n ««t,ve mean, to 

can only be .dentinc^'anTo^ 

(hence the term 'dosed" m^?£lk£ whkh afe in *• «".v 

molecular mechanisms i»v^? a p^^^J^ «- elucidating the 
to combme an open and closed JZ£^^ / '~*^'^*V* 
quanntate the express.on of known gTnes in mRsT" " f"^ Md 
system such a, SSH to isolate unkno47 en « whf^ P ° P " ,at,ons ' ™ d « open 

One of the mam advantages of A arra vs is rh k dlfferentia »y «P«»ed. 
-h,ch can be pur on a membrane™^ ~ Ch ? Cnumberof S'™ fragment. 
60000 spots on a single ghTSrTTT compan.es have reported griddinTup to 
based m.cro-arrays wtf, p obabK Llm ""^^ S,,dC) - ^ h * h de ™" 
item, in the near future' ^JS^^T **±+X 
different express.on « „ me .nd^reZLI ""^ rap,d of 
hfn cost and the technacaJ con^l^ ^^ i ^ pa V m - A » de their 
arrays, the mam problem which rern^n , « £ M ^ Md probift * DNA 
lgene-ch.p) technology. „ that reS^, " pecialiv w,tn tne newer nucro-arrav 
array,. However, th.s'problem iT^ZZT, Z TV, P"*** 
next few years. 8 addressed and should be resolved within the 



■™». of ii-o..^ „„ p„„ md '" J ™ v ™» »«. rapid ^d^,,,, 

Sis 



Differential gau-txprtuion 



at it takes a great deal 
m that they are indeed 
tissue. Normally, the 
3 CR. Even so, each of 
goal of rapid analysis 
>>' the development of 
5. Schena et aL 1996). 
intial gene expression 
i$s * chips* containing 
:tiple copies of part of 
y proven involvement 
ner cellular processes, 
ne and animal species, 
and a few companies 
-ech Laboratories and 
tds or even thousands 
cDNA from the test 
When analysed with 
quantitative means to 
ions. Of course, there 
hich are in the array 
h to elucidating the 
pment system may be 
directly identify and 
lations, and an open 
:rentially expressed, 
ber of gene fragments 
>orted gridding up to 
se high densiry chip- 
roduced off-the-shelf 
Did determination of 
:ts. Aside from their 
:? and probing DNA 
e newer micro- array 
eproaucibie between 
>e resoived within the 



•essed geses 
:lones obtained from" 
i! identity (putative 



? a rapid and efficient 
Jte profiles of gene- 
Adams et aL (1991), 
imated that there are 
^presenting over half 



ot all human genes (Hillier et at. 1996). Th,, large number of treelv liable 
sequences (both sequence information and clones are normally available rovalrv-tree 
from the originators) has enabled the development of a new approach toward, 
mnerennal gene expression analysis as described bv Vasmatai, et al. , 1998) The 

Z P ™Z ? T^SSr^'- EST daabMeS Zn nw " 3rched ior *™ ^ar have a 
faST E ST sequence, from the target t,,,ue or cho.ee. but none or feu 

from non-target tusue libraries. Programme, to aaaist in the assemblv of ,uch set, of 
mtemeT For e^i "T 1 — " turned pnvateiy or from the 

hZ n^lT P t ' ** In,mUie t0f GeBoraie Re "«<* (T1GR. found at 
commu^I^S PnVlde% T an> ' l00U free of char *' » rhe «.en tl nc 

rLTTr K ,? tm0n?,t the8e " the TICR *"««bler (Sutton « «/. 1 995 a 

aS^L. ^ ' y "* ° f overia PP«R da» such a, EST,, bactenal 

,en^n^ rora0,0me, " 8m0m «- C ' nd ' d » e EST clone, repre- 

ESta^*""" "If 1 ™ a ? aly$ed Ulm8 *- NA b,ot m « h ° d * »" »d tissue 

cDN^cronTftl^" 1 ^- ■" W i,0Utt « d ,demif >- the *■» Kl* 

CD.NA clone for further characterization. In practice however, the method is rather 

more .nvolved. requiring bioinformatic and computer analv,., coupled wi* 
Z^?Z I 'tZr*"- VaKnatZiS " * (, " 8) -ave described s^erd 
selenTe, *S ' PPr ° aCh - $UCh " Sepamin » h, ^>>- homologous 

St 7e qu£~ Mr" 1 Crent ^ Md ^ 0V «««P»"» or specificity for some 
fcST sequences. However, since these problem, will largely be addressed bv the 
development of more suitable computer algorithm, and an .ncrea^d c^mpleH 

e«rlL I " " llkC,y *" thU apprMch 10 ■demif>ing differential 

expressed gene, may enjoy more patronage in the future. 



Problems and potential of differential expresaion techniques 

The holistic or single cell approach ? 

When working w.th in vivo models of differential express.on. one of the first 
.,,ue, to cons.der must be the presence of multiple cell rvpe, in anv g"ven ,pec!m« 
For example a hver sample « likely to conta.n not onlv heparo^, ZTZ 

potentially) Ito cell,, b.le ductule cell,, endothelial cell,, vanou, .mmune'ceX (e 7 
^mphoc>tes. macrophages and Kupner cell,) and fibroblast,. oZTSS. SS 
« ««r own oxsnncnve ecu poouiano™. Aiso. m the cu,eu« neoousoc «, U e 

a^nu V 710 " T n0maJ - h ^ iasnc °>-P'««c cell, pre^Tm a 

sample. On. must, theretore. be aware that gene, obramed from a differenual 
display expenment performed on an animal tissue model mav not Le^Sv^ 
exclu,,vely from the mtended 'target' cell,, e.g. hepatoc^e^neopla^rcellTlf 

?n TRT Prf K US T g ,mmUn 0hl »^"^. - «tu hybndlauon or 

!f » T ^ ld b * USCd 10 confi ^ wnich ce » ^e, are expressin £ the 

-5SSn^ mtereSt - ThU / r ° b,em " Pr ° bab,y m °" acul ^ those ^5 
diftoenual expres..on of gene, in the7t«vetepm«rof different ceU type., where 
there a need to examme homologous cell population,. The problem 

^rnte^ni eNw ^^cerl„ S t,«« e (Be,h«d l ,MD,USA)wh e r.„" ^ 
d»ect^techn,que, have been employed to as,i,t in their gene analy,i, programme. 
AeCancer Genome Anatomy Project (CGAP.) IFot moreinformauon L web she • 
h?P=//www.„cb, .alm.nih.gov/ncicgap/intro.html). There are also sepaHIon^: 
n.qu« avadable that utilise ceU-specific antigen,",, a mean, to i,ol,t e u^ceS 
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w»Uun . compronuzed tiasue .hould T^ Z*™' "* ' Xpnuion 
tissue, w comply mixes of dlfe^.^^Atein.-eidl 

w.v contribute (positively or negTS " T *" Wh ee " «"* ««W '» 
*h,ch he belund response, to esternaTs^mu, or n" . m ° ,eCUiar ™ ch *»«n« 

'PP-^ 
— 

>s clear that individual. (humaT,™?" ^ ? * * n,mal models are bemg ,»ed I, 
stunuli. One of the brt^cTe^Texl re ? P ° nd " * a >' ">^cal 

PoK-morphism. which i, -2S^ ev £2£ £ , J^""* oxidatiS 

=n^^ 

should, therefore. 1*^^^^ " ~P— Carefu^£ 

value of pooling starting » ^ 1--^ 

benencal through the ironing out o Ln^Jt ^ effeCt ° f thi » «=» be 
fluctuate, of (mechanically) uSSTST^!?* un ' m P°nam minor 
prov, dl „g . dearer overaJl picire 0 f t^L* ? m , md,vidu " animal,, thus 
response. However, at the « *** T m ° ,eCU,ar danism, of til 

S PO "T emdeC,dW8 * eab ^o^ may be of utmost 

effects of a g,ven chemical/disease. md ' Mdual ammals to succumb to or resist the 

^S^r d ^ rresnon at run * nns a *■* 

^^l^^S^^^^^^ -^esnng that mam. 
(Mechler and Rabb.rt, ,981. Hedncli Tal 984° V ^l*?™ " ™ «™ 
h.gh a. 20-30000 have also been ouored ,\ , ? V ° ,990) • althou fc* figures a. 
P^devidencesuggestmgtTatm^ ( ^ d " a/ - 1976 >- Hedrick et al ^M) 
claa,A breakdown ofthu abund^TS^ ■ 
- —\5Wthe reaultt of different^^™^ B '* 

dauobtamedprevioualyuamgotherTJS^ ^ b « n compared with 
exp^edmRN^areipresentd^^ 

(which. importantly, often include reflate™ *» Particular, rare messages 

"^crentu, display .yaterns Th ^jHon 

-Apopul^^ 



. 1998. Kas-Deelen et 
Rogler etaL 1998). 
us issue unimportant, 
ing altered expression 
ion. After all, since all 
pes which intimately 
each cell type could in 
nolecular mechanisms 
growth. It is perhaps 
ments using in vivo as 
ienucal ceils probably 
noiecuiar changes that 

•a! biological variation 
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r mechanisms of the 
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:cumb to or resist the 



a nigh percentage of 
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Hednck et al. (1984) 
:o the rare abundance 
n table 1. 

been compared with 
i: not all differentially 
ncuiar. rare messages 

not easily recovered 
ig. as the majority of 

population (table 1). 
iates (heterogeneous 
able to detect mRNA 
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spece, present atle» than U « e of the total mRNA po P uJation-eau,va| em to ,n 
« ln^° r abund r* ec "«- I»«re,nnglv. when ,«n P le model u^,: 
Zml^lUr T* Tf* 1 ° f * he "«>««>«»« »RNA populauon. the »L 
^ orTh?M le T eU0f «W» R *A*m» to 10000 x , ma l, er . These results 

orodu^^H S**^ by com ^ on *>' »«b»rates from the manv PGR 
products produced in a DD reaction. 

vJZHUSE ° f dUIeren " aU ^ ' «P^«"d mRNA, reported ,n the hterarure usmg 
mRslT^J, P T d f fUrther <VidCnce that ™ n ? d'ff«en«allv expressed 
lh„;t " c ° v ««i- For example. DeRisi et al. (19971 used DS\ arrav 

m^ ^fT: gCne eXPmSt ° n * yWt «*««*«» of sugar ,nTe 

ofle" 8 0^1 5 0^ i ' " U ° U,d not be "Enable to suggest that 

up » 5SoT or 21 ^ ^ " Ptd " Pf ° du " d by * ,ven mammalian cell. 

^ LTkIT ,h ° W ah r d eXpre " i0n fo " OW,n 8 chem ' cal ««n,ula«on. 
"mist this may be an extreme figure, it is known that at least 100 eenes are 

eU, diffelt'T " " * (, " 6) " dmmd *" 'n"rferon-v. s „l,. led H eU 
exLse^ bv rf y " P * **" (a " Umin ' 24000 mRNAs 

Im^hl! y ! H ° WeVer - there have been ,ew P"blicat,ons documenting 

^l l7" feCOVer> ' ° f the$C numbers - For ««"P ,e - in »«f DD to comparl 
totaTbLl t T n ^ tmg "ST Uvef - BaUCT " aL (1993 > found 70 of7 8 <55 
diSr^t , ° f 5 °°'' (3S •>—» to cormpondto 

7 ' V " f ° llow,n « ""adiol treatment. McKenz.e and Drak (tw£ 

2sss Le^ e ; e P n v L r: r ucw whose expression ~ au ~ d 

mvITo™ * tumour Promoter agent) stimulation of a human 

^odtT"^ Ce " lme - KUt> " and Vicke " (,997) ,den « fie ° »0 different^ 
al / eXP r feSS, ° n u P^ u,ated in the penpheral blood leukoc^Tof 
It * d,sease sufferers. Lmskens er al. (1995) found 23 genes dmSZlk 
expressed between young and senescent fibroblasts. Techn.ques other th^ DD 
have also provided an apparent paucrv of different.allv expressed genes U^ e SH 

™mParedt: - " 15 differen " aI '- ^ 

cancer compared to normal mucosal epithelium. Fitzpatnck « al. (1995) isolated 17 

c7ofib»te re phT ^ ,n T/roon 1 ' 0 " 0 ^ trMmient W " h the P^oxJome p^or 
h °HK ^, PS " nW0> ,SOiated 12 cDNA cion " ^ ^re uore«uJatTd m 
n.gnlv metastanc mammary aaenocorcmoma ceil hnes compared to pooriTmet. 

"endfieT a;^" and , W T man (1 " 6) "" d 3 ' re "" C »°" 

:t= n =tce,:° 

to snow'IlST 0 ^' SSH W " U " d l ° "° ,ate Up lo 70 candida « 8cne, which appear 
to show altered express.on » guinea pig liver followmg shon-term treatment whh 
the Peroxuome proliferator. WY- 14.643 (Rocken. Swales. Esdaile anTc^on 
unpubhshed observations). However, these findings have s il7« be c^rmrmlThv 

andloe^^ 

Tlh^STZ modificano^ w overcome ih,^ . Efficiency (in both the toS 
number of d,fferentidly expressed gene, recovered and the percentage that „e^e 
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-wastes- *'^sLsr5»- 

least in the cue 0 f SSH* Su ^ tra «'«>n kit user manual) It L*J1 1 degree 

Of course, probably th# m «™P*«y 

one point « tune. It may well be Jhil Can ° n,v ™errogate a eel 

espress,on at that time are nK M ;..j Vi Percentage of the genes showing j 

„ the „,„„,, of ~'»^ >■".« »un. which, of coune,* 
wao« expr.,,,™ i, ctu ^ rf °» B ie>. AJ though <ht UoLtion 0 f 
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ribed to some degree 
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ily interrogate a cell at 
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iscades of signalling, 
ies which are switched 
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ssue of how large the 
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:he isolation of genes 
reported using SSH 
mstranng a change in 
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D ^ on ? e ° ther « «« » ^ 
-fo.d o?:or^^ti7 D 6 ^ po " ed that d,n " eren " , ,n expresi,on * 

*» * could be 

given £t wJmZ " SR o'r DvL'^V^™* ahered Kpr " s '°" «" ™ 

used as standard in DD «r~« m «„. c 0 tiambrook « a'- 1 989) and are 

product, such »*o£,7^^ 

wha«appe««>beo^SLg^v L Unre$ ° lvable «»»PO"~ Thus, 
been well documented (Math" u-DauJe « S 1 J? Wh T^l!"** " hM 
band extracted from a DD often renr™ ' " h " aL 1 997) th " a « n gle 

and the same fouTd^SSH d ' , COmP ° $,te ° f h « ero *-«»» P">du«,. 

1997). One possiblHolu ™n w« tffwed'bx M "k * i 4 * 0 ™ 0 ^' « 
extracted and reclined ^«S r ^^^' ude ;' (,996 >' wh ° 
conformation polymorphism <SSc£ anaK™ I » " d u " d " n « ,e 

represented the truly differently expressed product C ° mp ° nen « 

Many scientists often try to avoid the use of P ir.F ui t 

technically more demanding than ^JZZ i V P °"' ble becau * e " '» 
h.gh resolution »wJ^^TZS^S^- tAG ^ L ' n{or ^^ 
HR (National DUgnosr! He«£ UK?t£ C -* L ' Chhe,d - L K) and A « uaP ° r 
than PAGE, can onK* «o«te D V A M Pfepare 3nd m ™P«»«e 

l-W (15-20 b»e L ^fo?, 1Kb f SCquen "%*' h ' ch **er in s«e by around 

producu which diff r?n?^L , s?:^ 0 !^' ^ *° A ° F » Uch 
However, a s.mple technic. £ , m ° Um arc norma »V not resolvable. 

(b,sbenzam,de.PEG ^HH^^ T? '1*'™° U *> nd) or HA-yellow 
gel separate, J^^l^Z^^T^" 10 " 1 ' « 
HA.red and -yellow select^S* ScS It S^""™ Sptdfiea,,y - 
■^awer « a/. 1995. Hanse Anaivnk log- DNA moms - re *P«ctively 

HA-stam, p OMe „ an o^U^S- commun *"»<"»- =«ce both 

when an electnc fiS, apohed ^ """'J *« 
» neganvelv charged and XreforT^ " ° PP ° Sm ° n 10 DNA - wh ' ch 

DNA clones are fdenncai S Je 7. T"^ the anode - Th '»- if w <» 

agarose gel), but differ m \T/GC P*™™* ™ * "andard h.gh re«,lution 

will effectiv ^ t^Z^^T^-T^ ° f 3 HA " dye ^ the * 
other, effectively „k£\ TZgZll ^—es compared to the 

differentiahng between the two Th« T H A.red has ^ * ° f 

seouences with an ax /, riA-red has been shown to resolve 

to distinguish nTS^utS^i^ ^ CW " W " 

(Hanse An«Iy«ik 1996 personTcoil ^ °" ly " S,n8 ' e point mulation 

whether all the cloned ^SS2^^E^^^ foW • tf ^ ^ w ^ 
-«pe«ment^eder^ f?om ^ band fa 8 diff ««rial display 

g can be run on a standard h.gh resolut.on gel. and a second aliquot 
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«D«r,mem,„ d cloned. S«-en eolon^ w„e '? ck L »ubtr,«,v, hvbnd«»2 

iK'^^^^'^^-EE: T m eloned "* 

gei. and (B> a high resolution 2 °. u., Ate ..i * geil * * Ai a hl » h fwoiunon 2 

*el BV wh.ch .ep.r,. n .dentici!lv.,i led DvVL^'^ w " owever - *« Presence of HA-red 
«mmpl«. .ho^ ^, ^ „.^,K?S^ a ^£. , T; , « ene ,pee " -*» each band. 



•n a similar gel containing one of the HA ■ -r'u 

any gross s.ze difference!. I V?*"* $h ° U,d indic «« 

-resolvable spec.es (on «MdIri\cE) ' k ^ ° thenvi » e 

« «/. (1997) reported successful us/oT^ * W bl4e conte «- linger 

clones. F.gure 10 shows sucnt exlerlenr c?'' H ^ *" ^""^ D D-deri^d 
obta,ned from a band ^^^^ » *» »°°™rv on Cone, 

An alternative approach is to cam- out a - D? ' 
orooucts. in tru, approach. s,ze -based seoarlrL " *Phr 

aearose gel. The gel sl.ce contammg the d«ofa I Came ° ° Ut «" 3 Jttnda ^ 
» «• a HA gel for resolunon base7on AT/GC " ^ '""grated 

^p C e° c ^cn I^ne st™t * differ, 

even these specie, are not « abl ^^ r ° C/AT 
SSCP. or perhaps, denaturing gradien X.t , v effo "- a S»'n. one might we 
gradient field electrophoresis ^GE) ap l^ 

-her directly on tne extr>ct < d ^S^»J^*~ -tent, of. b«d. 

product. U * UKI " a/ * 199 U or on the reamplified 

"""2^3^ to v„ ualiM large 

of numbers, the resolution o tTgE ra~ K - - ' 3 Pr ° blem in lh "' « ^rrn, 
overcommg this might ^o^^^' 0 ^ 00 b " d *- One approach to 
ol. (1989) and Hatada « al <199i) SUCtrmhWrt " cri oed bv Uitterlinden et 
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lA-rcd. Bandi of decreasing 
i subtnctive hybridization 
each cloned band ano rheir 
high resolution 2 ° 0 agarose 
*ed. With tew exceptions, all 
• er. the presence of HA-red 
he percentage of CC within 
:ies within each band. For 
>e the same sue. at least four 



rd gel should indicate 
Id separate otherwise 
ase content. Geismger 
entifving DD-denved 
:s laboratory on clones 

he differential display 
"^ec out in a standard 
rtec and incorporated 

there being different 
\T content. However, 
-again, one might use 
»GGE i or temperature 
he contents of a band t 
or on the reamplified 

jes to visualize large 
3blem in that, in terms 
ands. One approach to 
bed by Uitterlinden et 



some eases (e.g. DD. GEF). the results are visualized bv autoradiography means 

^ZZt^Tt CI-*. • ™^ extracnon can 

hTZJ^ST* ^J""™ lon ; Th» problem, and that or the u,e of rad,o,sotopes. 
Slo^jlr ed , by ^rOUp, • F ° r ^P^.-Lohmann er al. 

i™^pi?* t J llV ' T r mmt * U,ed directl >- » DD band, » 

horizontal PACjuA. > « a/. (1996) .voided the «e of radio.sotope, bv transferring 

.suahzmg the band, using chemilummescent staining before go.ng back to extract 
he rematmng DNA from the gel. Chen and Peck (1996) went one «ep further and 
transferred the ennre DD to a nylon membrane. The DNA band, were then 

onm T 8 I d ^ xi « enin (D,G > *«« (DIG w„ aaached to the pi. dT 
pnmer, used m the different* di,pl.y procedure). Differential* expressed bands 

To^^^ 

d isn ? ne ° f 8dvanla « e, of "cheque, such a. SSH and RDA is that the final 
display can be run on an agarose gel and the band, visualized with simple eth,dium 

«T£iRG m *- ?*\^ a r oach - prov,de 2E222S 

The Tl 1 ° r " ° 0ld nude,C acid »*» < FMC > ^««velv enhance! 

^ ,0mefamt P"* 1 "^ *" may otherwise be overlooked. Whilst 

wavelenS tZllt * mCd,Um —«'»«* (306 nm). the shorter 

rLSS^ri - extracted under 2S4nm irradiation, effectively preventing 
^amplification and cloning. The best approach is to overstam with S YBR (S 
and extract bands under a medium wavelength UV transillumination. 

The oowible use of « microfingerprinting ' to reduce complexity 

h 9 „?7 n ! ' nUmber ° f gene products ^ rhe P° ssib,e complexin- of each 
band, an alternative approach to rap.d characterize mav be to use an enhan^d 
analyse of a small ,ecnon o: a differennal d„play- a -sublrin.erpnnf orWro- 
nfopnm In tn„ case, one couid concentrate on those bano, which oaiv 7pZr 
ma particular cnosen s*e region. Reaucing the nngerpnnt in :m, wav « £ 
rwo advantage,. One ,s that it ,hould be possible to use different gel rvpes 
concentrates and run ume, tailored exactly to that region. Current"- Sne Z£ 

^:lZt m ^T' ^^^^^^^^ 

SL k g , d e0Me, » uenll y to «ubopt,mal resolution, both in terms of 

,«e and numbers, and can lead to problem, in the accurate excis.on of indiv^uL 
b«ds. Secondly. » may be possible to eniunce resolunon by usmg a 2-D J££ 
usmg a HA-staon. a, described earlier. In summary, if a range of gene P roduT,L« 
..c^Wly chosen to mcludedce^^^ 

^ z°zz x ; r e r a,ysis r d - k may be * d ^ • ^ 22 

"MM?LftZ? denuficauon of compound, which have s.milar or widely different 
cellular effects. If the prognosis for exposure » one or more other chemicals which 

eff?«for^ 

effect, for any new compounds which show a similar micro- fingerprint. 
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reaction ^5^ ^ ^ / "'^J? * PCR P " men 

genes. c>.ochro m « ^olTd"^" ** reCept ° r$ - ce » 

for analysis « ^£,7^ ^PI 0 "™™^ be cons.dered a, candidate, 

damage response etc. ™nerent responses e.g. apoptosis. stress. DXA- 



Screening 

False positives 



1994. Sompavrac « «/ ioo'cJ tv! ! ' , ' N " h «°««'- 1994. Sunrta/. 

been HPLC purified can lead ^t?™/ r, UM of ada P t0 " whic *> have not 
ligation even* (O^.^^ 

^"tifacuandiHegi^ " »~ *«-* 

to be derived lareelv from a k.,,^r A In SH> faIse P°«tiv« appear 

* 1984. Sakaguchi 

tester probe. hSSt dnVe ^ e f^T"^ clon « hybridize «o 

may not generate de ecubTe hTtrid £ f ^ 3PPr ° aCh " that rare 

» » screen the cloning a labetd ^ ?*" *" th0 " SSH 

from which it w„ derived and Z 1 f' 0 ^"'? 6 ? fro "> *e subtracted cDNA 

it should be possible » confinr rJi. „T ^ '""^ rMe 

«nes. Despite this quid ^2^^^ ^„^^^ ,Wab «^ 

Poor by today', high standards and one rnu.t relv on PCR me^d,?' " 
sensitive determinations (see below). methods tor accurate and 

Sequence analysis 

Im^wKd foS^^ PrOC l dUreS Produ " final P rod "- which are 
the sequent Tor ZyZsoZ mZ^JT^ "*« ^ * 
confidence in the result-severJ I Wi r tUm Uads to a redu ^ed 

^-esare-.^^ WA 
P450genesuperfamily(Nekon« flM W6; J ttre ? he «- e « the cytochrome 
almost identical to gene I^i^ *~ *" d ° ne identified » 

« yet -discover^ *«* *: o, li 

~* ror example, using SSH, part of a gene was isolated. 



mine altered expression 
R primers and 'or post- 
■ receptors, cell cycling 
onsidered as candidates 
arrays (e.g. Clontech's 
this to some degree by 
:poptosis. stress. DNA- 



ar length amongst the 
no etal. 1994. Sun etal. 
stives varies wuh the 
iaptors which have not 
ves through illegitimate 
they can arise through 
I. false positives appear 
t some may arise from 
: for technical reasons, 
ones can be carried out 
md probes synthesized 
said clones ( Hedrick et 
iones will hybridize to 
ach is that rare species 
>n for those using SSH 
: the subtracted cDNA 
he reverse subtraction 
inches rare sequences, 
senting low abundance 
ieed :o eo back to the 
; 3 more quantitative 
riots. :ne sensmviry is 
::hods ior accurate and 



(2) 



Differential gene expression 

h^tr^TT"'"!."^ *■ I,Vef ° { ™ tXpOKd t0 W >-'^3 and wa, .dennried 
bv a FA5TA search as being transferrin (data not shown). However, transferrin is 

^ b> ' ******* P"*«» P-Hferators sueh . 

!!,K X " 1 96)> ind thtt was ""finned with subsequent RT-PCR 
Je£j ^ $ th " thC ^ " qUenee » ol " ed m - » ■ e*ne which 

t fun^'orlM traMfemn - but » by , different mechan.sm. 

befor/sH . * a " OC,ated w,lh SH "chnology ,, redundancv. l„ mo$t c „ e , 

df«st,on ?K Cd *" C ? NA P ° PUlati ° n mU " *™ be bv^est 
digestion. This is important for.at least two reasons: 

formation of appropriate hybrids, especially at the high 
concentrations required for efficient hybridization 

.^dla?^^ 10 ^ n fragmentl PTOV,deS be " er "P—,t,on of 
™w « ene, \ Th, « » »«>« derived from related but distinct 

££££ IT^T 1 "" of T have s,milar codin * seouen «* *« 

hybridize and be eliminated during the subtraction procedure (Ko 1990) 

or ""Potion and. thus, may not emciemlv do one 

expressed cDV^f ^V"^ ««. fragments from differentially 
expressed cDNAs may be eliminated during subtracts hvbndization pro. 

consequence of th». some gene, will be cut one or more times, giving rise tonvo 
or more fragment, of different ,«es. If those same genes are differently 
expre^d. th „ mofe of ^ ^ ^ ^J£J»> 

redundancy and increasing the number of redundant sequencing reactions. 

Sequence comparisons also throw up another important po,m-at what decree 
of sequence similanry does one accept a result. Is 90 ' 0 .demitiv between .7ew 
denved from your model spec.es and another acceptab v closeM 9™?be,we^ 
your sequence and one from the same spec.es also acceptable > This oroWel U 
particularly relevant when the forward and reverse sequen c^pari^ve 

with comp,ete,y different ?ene ^ An izsz 

>eems to be to allocate genes mat are oerinite (95 ° B and above sirrulanrv. and th» 
group those between oO and 95 >. as being related or possible nomo^es 



nal products which are 
rabiy reduce the si2e of 
-im leads to a reduced 
nembers whose DN'A 
s. e-g- the cytochrome 
one identified as being 
brother gene X, or its 
of a gene was isolated, 



Quantitative analysis 

Mn A !4*! me P ° ,nl '- 1" 6 mU " g,VC conside «»on » the quantitative analvsi, of the 
candidate genes, either as a means of confirming that thev are trulv diffe"mUly 
expressed, or m order to establish Just what the differences are. Nor*e™^ot 
analysis „, popular approach as it is relatively easy and quick to perfom^However 
to^t « ^ W,lh c Northe ™ blo « » ^ - often not sensitive enTgh' 

■ -iS2T7 J" , 6 1 } " lh . i$ " * ma, ° r Pr ° b,em - Co ""^«tly. RT-PCR may t Z 
-method of choice-forconnrmmg-diffcnrnti ^ l auil^iuu. A lthough the procedure is 
.omevrhat more complex than Northern analysis, requiring synthesis of >rime» »d 

2?? atl T re ^ n C ° ndiri0,is f0f cach * ene >t*™' « » *»" P<»«ble to £up 
high throughput PGR systeffuTuiihg mulitchannel pipettes. 96 + .well pUte. «d 
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money „„ rime SSTTLST " ' d " ,,e * ° n " mtCnul *• 
especially when one mjh; ^ w" 101 " 11 ^ " 

must fir* of all choo, e an ^^1" ° n < 

compared to the control* " " 0t cnan * e sn the te » cells 

example fa-ribcn^^S^iST?:^^ l " ed the P a »- *>' 

hydrofolate reductase (DHFR. Mohle'r £d b££^w> I'V i J 9 ,^* ^- 

* »998T and . num*r 0 ^^^^ tfanS,era " ,HPRT - « 

standard should not ^^ ( ^' d, ? W »r >W). Ideally, an mtemil 

stage in the cell cvcle wdSouri, *T .ff J^T" 00 m the re «" d '«» of eell age. 
shown on numerous £22t^ f f 1 - 1 " imu,i - H °*«ver. it ha, been 
-ed by the ^-rtitSSSl^ ' f m °" h ° u " k «P-«ene, currently 
different tiuue. (Clon^**^" ""^ " min COnditiw » — « 

lim.naryexper^ent.be^dTm o^ therefore « *« P- 

the,r suitability for use in the model svstem hou$ek ~P"* genes to establish 

Interpretation of quantitative data must al„ , « . , 
companng the lists of genes identified bxdTff- * W " h caution ' B ? 

For example, rat, and mice appear «nsi ive » th^ 5 W,y » ? "J*™ 1 
range of peroxisome proliferate JZ£^^™*T K ? C *"* ° f * 
resistant (Orton « a/ 1984 Sri*. ^ hanwers » d «™nea pig, are largely 
Makowska „ a/. 1992) " ^Li2b&?. u^"" 198? ' Uke " * I9 «. W, 
compare list, of up- i^^ST* " *" wl * » » 

expressed in only one ^ ***** ^ Which 

rhes,idgene.mightsugUam«1;^ 

or protecnon. Of course, the situation ,« l J , a " dnon - genorox,cc arcmogenesu 
^^«on.k*i^;£^^^» be far ™" comply Perhaps if 
upregulated 50 time, bv P Ps m e S ^,T P * ™ non -*«">»»»c effect* and it wa, 
» the rat. However. -L^t^KE*" ^ *• time, 

gene may be overlooked Just « ir! u P re « uia »d. the .moonance of the 

does not necessariK met ISZ^ ' '"T ^ » 
true relevance of gene Y w h ,ch no", " 5 TSd ^ F ° r CSamp,C ' wh " » *• 
and gene Z which shows onlv a 5 " d in^"^ ^ ' PZ ^" 
may find that historically, gene Y has oft! n K v. the literature one 

fold by a number of iuJ^^^I^ST * be "P*^" 1 "^ *0-60- 
■IWl««intficw1to^TS^ hBht ° 7 *» 1116 50 - fold «crea« would 
recorded a, Sving ^^^£^2^^ ^ 2 1« »mr l»« 

increaae all the more «d^l2^^2riT^ h ^ ^ 5 - fo,d 

increase has only been seen in rIl«.H r - _ «ter«tmg » if that same S-fold 
chemicals. . ° m ' elated ne ° pnums or follo -«8 treatment with related 

-Prtbrem, S Tag tM-«IM^ard MpE y .pproac h— 
••-S^JSS,,^ ^ an easily obuinable 

« developmental proc^o^^^^ 



arivc analysis is more 
i internal standard, the 
•uJe is often excessive, 
eds of gene species. The 
datively involved. One 
change in the test cells 
een tried in the past, for 
-in (Heuval et al. 1994). 
Vong et al, 1994), di- 
^-microglobulin (£-2- 
sferase (HPRT, Foss et 
b). Ideally, an internal 
II regardless of cell age, 
li. However, it has been 
keeping genes currently 
srtain conditions and in 
-e ( therefore, that pre- 
iping genes to establish 

ated with caution. By 
ession one can perhaps 
-vays to external stimuli, 
lotoxic effects of a wide 
d guinea pigs are largely 
Lake et al. 1989, 1993. 
the reason(s) why is to 
dentify those which are 
ledge of the effects of 
enotoxic carcinogenesis 
>re complex. Perhaps if 
otoxic effects and it was 
up-reguiated five times 
i. :he importance of the 
;<r cnanee m exoression 
or exampie, what is the 
■ a particular treatment, 
"nines the literature one 
be up-regulated 40^60- 
50- fold increase would 
it gene Z has never been 
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Abstract 



Recent progress in genomics and proieomics technologies has created a unique opportunity to significantly impact 
the pharmaceutical drus development processes. The perception that cells and whole organisms express specmc 
inducible responses to stimuli such as drug treatment implies that unique expression patterns, molecular nngerpnnts 
indicative of a drua's efficacv and potential toxicity are accessible. The integration into state-oi-the-art toxicology of 
ussavs allowing one to profile treatment- related changes in gene expression patterns promises new insights into 
mechanisms ofdrua action and toxicuv. The benefits will be improved lead selection, and optimized monitoring of 
drug efficacy and safety in preclinical and clinical studies based on biologically relevant tissue and surrogate markers. 
£ Z0O0 Elsevier Science Ireland Ltd. All nghts reserved. 

Ke\*»rth: Proieomics: Genomics: Toxicology 



1. Introduction 

The majority of drugs act by binding to protein 
targets, most to known proteins representing en- 
zymes, receptors and channels, resulting in effects 
such as enzyme inhibition and impairment o! 
signal transduction. The treatment-induced per- 
turbations provoke feedback reactions aiming to 
compensate for the stimulus, which almost always 
are associated with signals to the nucleus, result- 
ing in altered gene expression. Such gene expres- 
sion regulations account for both the 
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pharmacological action and the toxicity of a drug 
and can be visualized by either global mRNA or 
global protein expression profiling. Hence, for 
each individual drug, a characteristic gene regula- 
tion pattern, its molecular fingerprint, exists 
which bears valuable information on us mode of 
action and us mechanism of toxicity. 

Gene expression is a multistep process that 
results in an active protein (Fig. I). There exist 
numerous regulation systems that exert control at 
and after the transcription and the translation 
step. Genomics, by definition, encompasses the 
quantitative analysis of transcripts at the mRNA 
level, while the aim of proieomics is to quantify 
gene expression further down-stream, creating a 
snapshot of gene regulation closer to ultimate cell 
function control. 
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2. Global mRNA profiling 

Expression data at the mRNA level can be 
produced using a set of different technologies 
such as UNA microarrays. reverse transcript 
imaging, amplified fragment length polymorphism 
<AFLP>, serial analysis of gene expression 
(SAGE) and others. Currently. DNA microarrays 
are very popular and promise a great potential. 
On a typical array, each gene of interest is repre- 
sented either by a long DNA fragment (200-2400 
bp) typically generated by polymerase chain reac- 
tion (PCR) and spotted on a suitable substrate 
using robotics (Schena et al.. 1995; Shaion et aL 
1996) or by several short oligonucleotides (20-30 
bp) synthesized directly onto a solid support using 
photolabile nucleotide chemistry (Fodor et aL 
1991: Chee et al.. 1996). From control and treated 
tissues, total RNA or mRNA is isolated and 
reverse transcribed in the presence of radioactive 
or fluorescent labeled nucleotides, and the labeled 
probes are then hybridized to the arrays. The 
intensity of the array signal is measured for each 
gene transcript by either autoradiography or laser 
scanning confocal microscopy. The ratio between 
the signals of control and treated samples reflect 
the relative drug-induced change in transcript 
abundance. 



3. Global protein profiling 

Global quantitative expression analysis m t ^ c 
protein level is currently res:ncted to the use of 
two-dimensional gel electrophoresis. This tech- 
nique combines separation of tissue proteins bv 
isoelectric focusing in the nrst dimension and by 
sodium dodecyl sulfate slab ge! electrophoresis, 
based molecular weight separation on the second, 
orthogonal dimension (Anderson et al.. 1991) 
The product is a rectangular pattern of Drotein 
spots that are typically revealed ox Coomassie 
Blue, silver or fluorescent staininc tFis. v 
Protein spots are identified by mass spectrometry 
following generation of peptide mass fingerprints 
(Mann et al.. 1993) and sequence tags (Wilkms et 
al.. 1996). Similar to the mRNA approach, the 
ratio between the optical densit> of spots from 
control and treated samples are compared to 
search for treatment-related changes. 



4. Expression data analysis 

Biomformatics forms a key element required to 
organize, analyze and store expression data from 
either source, the mRNA or the protein level. The 
overall objective, once a mass of high-quality 
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(DNA arrays. AFLP. RT1, SAGE ) 



(Two-D Electrophoresis) 



product 



Fig. 1. ftoduct.on of an active protein is a mulustep process ,n when numerous regulation svstems e*ert control at various «a«e* 
of expression Molecular fingerprints of drugs can be vuualized through «press,on profiling a. the mRNA level (genomics) usinc 
a variety of technologies and at the protein level iproteomicsi using two-dimensional «el electrophoresis 
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quantitative expression daia has been collected, is 
to visualize complex patterns of gene expression 
channs. to detect pathways and sets ot genes 
iiahtfv correlated with treatment efficacy and toxi- 
cuv. and to compare the effects of different sets of 
treatment (Anderson et al.. 19%). As the drug 
effect database is erowing. one may detect similar- 
ities and differences between the molecular finger- 
prints produced by vanous drugs, information 
that mav be crucial to make a decision whether to 
refocus'or extend the therapeutic spectrum of a 
drug candidate. 



5. Comparison of global mRNA and protein 
expression profiling 

There are several svnergies and overlaps of data 
obtained bv mRNA and protein expression analy- 
sis Low abundant transcripts may not be easily 
quantified at the protein level using standard two- 
dimensional gel electrophoresis analysis and their 
detection may require prefractionation of sam- 
Dies The expression of such genes may be preier- 
ablv quantified at the mRNA level using 
techniques allowing PCR-mediated target amplifi- 
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cation. Tissue biopsy samples typically yield good 
quality of both mRNA and proteins; however, the 
quality of mRNA isolated from body fluids is 
often poor due to the faster degradation of 
mRNA wTTSh compared with proteins. RNA sam- 
ples from body fluids such as serum or unne are 
often not very 'meaningful*, and secreted proteins 
are likely more reliable surrogate markers for 
treatment efficacy and safetv Detection of post- 
translational modifications, events often related to 
function or nonfunction of a protein, is restricted 
to protein expression analysis and rarely can be 
predicted by mRNA profiling. Information on 
subcellular localization and translocation of 
proteins has to be acquired at the level of the 
protein in combination with sample prefractiona- 
tion procedures. The growing evidence of a poor 
correlation between mRNA and protein abun- 
dance (Anderson and Seilhamer. 1997) further 
suggests that the two approaches. mRNA and 
protein profiling, are complementary and should 
be applied in parallel. 



6. Expression profiling and drug development 

Understanding the mechanisms of action and 
toxicity, and being able to monitor treatment 
efficacy and safety during trials is crucial for the 
successful development of a drug. Mechanistic 
insights are essential for the interpretation of drue 
effects and enhance the chances of recognizing 
potential species specificities contributing" to an 
improved risk profile in humans (Richardson et 
aL 199?: Sterner et aL 1996b: Aicher et aL 1998). 
The value of expression profiling further increases 
when links between treatment-induced expression 
profiles and specific pharmacological and toxic 
endpoints are established (Anderson et aL 1991. 
1995. 1996; Sterner et al. 1996a). Changes in gene 
expression are known to precede the manifesta- 
tion of morphological alterations, giving expres- 
sion profiling a great potential for early 
compound screening, enabling one to select drug 
candidates with wide therapeutic windows 
reflected by molecular fingerprints indicative of 
high pharmacological potency and low toxicity 
(Arce et al., 1998). In later phases of drug devel- 




opment, surroeate markers of rreatme-- 
and toxicit> can ne applied to optimize :r.e mom*, 
tonng of pre-chnical and dinicai >tudie> . Dohertv 
et al.. 



7. Perspectives 

The basic methodoioe> of iafet> evaluation has 
changed little during the past decades. Toxicitv m 
laboratory animals has beer, evaluated rnmaniv 
by using hematological, clinical jhemistn > n d 
histological parameters a> indicators of or^n 
damage. The rapid progress m genomics and pro. 
teomics technologies creates a unique opportunity 
to dramatically improve the predictive power of 
safety assessment and to accelerate the drug devel- 
opment process. Application of gene and protein 
expression profiling promises to improve lead se- 
lection, resulting m the development of drug can- 
didates with higher efficaex and lower toxic—. 
The identification of biologically relevant sur: • 
gate markers correlated with treatment efficacy 
and safety bears a great potential to optimize the 
monitoring of pre-ciinical and clinical trails. 
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From: "Afshari.Cvnthia'* <afshari<S'niehs.nih.gov> 
To: ""Diana Hamlet-Cox*" <dianahc@incyte.com> 

v 0 - car. see -.he lis-, of clones chat we have on our_12K chip at 
•l.r_. ■ -=-••«- - ; p- s .-ir. . sov maps • cuest ' c.or.esrcr. .err. 

.. ^ •"" " £ genes (2000K) thai we believed critics, tc to:-. 

W6 c^: ' -d b si ce ? lu!!; processes and added a set of clor.es ar.d £-"s 
:!?r We hive "nciuaed a se! of control genes (80-, that were se ecte= sy 
^Iotcs" be-a-se they die not change across a large set or arra\ 
ex^e-m-ntf Clever* we have found that some of these genes c«r.g. 
s^--ca---v after tox treatments and are in tne process ct -oo K ... s 
„«,V:: io ^"o* eac>- of these 80* genes across our experiments. 
Ou^chips are constantly changing and being updated and we nope t at cur 
Sta will lead us to what the toxchip shouxc reaily oe. 
t hope this answers your question. 
Cindy Afshari 



> -*-om- Diana Hamlet -Cox 

> Vent: Monday, June 26, 2000 8:52 PM 
To- afshari@niehs.nih.gov 
Subject: [Fwd: Toxicology Chip] 



> 
> 

> 

> Dear Dr. Afshari 



> Since I have not yet had a response from Bill Grigg, perhaps he was not 

> the right person to contact. 

> Can you help me in this matter? I don't need to know the sequences. 

> necessa-ily but I would like very much to know what types of sequences 

> are being used, e.g.. GPCRs (more specific?), ion channels, etc. 

> 

> Diana Hamlet-Cox 



> 



Original Message 

> Subject: Toxicology Chip 

> Daze: Mon, 19 Jun 2000 18:31:48 -0700 

> From: Diana Hamlet-Cox <dianahc@incyze.com> 

> Organization: Incyte Pharmaceuticals 

> To: grigg@niehs.nih.gov 
> 

> Dear Colleague: 



I am doino literature research on the use of expressed genes as 
oha^ — ^nd the Press Release ^JJJ^ o 

29. 2000 reoarding the work of the NIEHS in this area . would like 
know i* there is a resource I can access (or you could provide?) .hat 
woulc taive me a list of the 12,000 genes that are on your Human ToxChip 
Vicroarrav. In particular, I am interested in the criteria used uo 
select sequences for the ToxChip. including any control sequences 



Thank you for your assistance in this request. 



> Diana Hamlet-Cox, Ph.D. 

> Incyte Genomics, Inc. 
> 

> 
> 
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> original message. 
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ABSTRACT Pairwise sequence comparison methods have 
been assessed using proteins whose relationships are known 
reliably from their structures and functions, as described in 
the SCOP database [Murzin, A. G., Brenner, S. E., Hubbard, T. 
& Chothia C. (1995) /. Mol. Biol 247, 536-540]. The evalua- 
tion tested the programs BLAST [Altschul, S. F., Gish, W., 
Miller, W., Myers, E. W. & Lipman, D. J. (1990). J. Mol Biol. 
215, 403-410], WU-BLAST2 [Altschul, S. F. & Gish, W. (1996) 
Methods Enzymol 266, 460-480], FASTA [Pearson, W. R. & 
Lipman, D.J. (1988) Proc. Natl. Acad. Sci. USA 85,2444-2448], 
and SSEARCH [Smith, T. F. & Waterman, M. S. (1981) /. Mol. 
Biol 147, 195-197] and their scoring schemes. The error rate 
of all algorithms is greatly reduced by using statistical scores 
to evaluate matches rather than percentage identity or raw 
scores. The E-value statistical scores of SSEARCH and FASTA are 
reliable: the number of false positives found in our tests agrees 
well with the scores reported. However, the P-values reported 
by blast and WU-BLAST2 exaggerate significance by orders of 
magnitude, ssearch, fasta ktup = 1, and WU-BLAST2 perform 
best, and they are capable of detecting almost all relationships 
between proteins whose sequence identities are >30%. For 
more distantly related proteins, they do much less well; only 
one-half of the relationships between proteins with 20-30% 
identity are found. Because many homologs have low sequence 
similarity, most distant relationships cannot be detected by 
any pairwise comparison method; however, those which are 
identified may be used with confidence. 



Sequence database searching plays a role in virtually every 
branch of molecular biology and is crucial for interpreting the 
sequences issuing forth from genome projects. Given the 
method's central role, it is surprising that overall and relative 
capabilities of different procedures are largely unknown. It is 
difficult to verify algorithms on sample data because this 
requires large data sets of proteins whose evolutionary rela- 
tionships are known unambiguously and independently of the 
methods being evaluated. However, nearly all known ho- 
mologs have been identified by sequence analysis (the method 
to be tested). Also, it is generally very difficult to know, in the 
absence of structural data, whether two proteins that lack clear 
sequence similarity are unrelated This has meant that al- 
though previous evaluations have helped improve sequence 
comparison, they have suffered from insufficient, imperfectly 
characterized, or artificial test data. Assessment also has been 
problematic because high quality database sequence searching 
attempts to have both sensitivity (detection of homologs) and 
specificity (rejection of unrelated proteins); however, these 
complementary goals are linked such that increasing one 
causes the other to be reduced. 
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Sequence comparison methodologies have evolved rapidly, 
so no previously published tests has evaluated modern versions 
of programs commonly used. For example, parameters in 
blast (1) have changed, and wu-blast? (2)— which produces 
gapped alignments — has become available. The latest version 
of fasta (3) previously tested was 1.6, but the current release 
(version 3.0) provides fundamentally different results in the 
form of statistical scoring. 

The previous reports also have left gaps in our knowledge. 
For example, there has been no published assessment of 
thresholds for scoring schemes more sophisticated than per- 
centage identity. Thus, the widely discussed statistical scoring 
measures have never actually been evaluated on large data- 
bases of real proteins. Moreover, the different scoring schemes 
commonly in use have not been compared. 

Beyond these issues, there is a more fundamental question: 
in an absolute sense, how well does pairwise sequence com- 
parison work? That is, what fraction of homologous proteins 
can be detected using modern database searching methods? 

In this work, we attempt to answer these questions and to 
overcome both of the fundamental difficulties that have hin- 
dered assessment of sequence comparison methodologies. 
First, we use the set of distant evolutionary relationships in the 
scop: Structural Classification of Proteins database (4), which 
is derived from structural and functional characteristics (5). 
The SCOP database provides a uniquely reliable set of ho- 
mologs, which are known independently of sequence compar- 
ison. Second, we use an assessment method that jointly mea- 
sures both sensitivity and specificity. This method allows 
straightforward comparison of different sequence searching 
procedures. Further, it can be used to aid interpretation of real 
database searches and thus provide optimal and reliable 
results. 

Previous Assessments of Sequence Comparison. Several 
previous studies have examined the relative performance of 
different sequence comparison methods. The most encom- 
passing analyses have been by Pearson (6, 7), who compared 
the three most commonly used programs. Of these, the Smith- 
Waterman algorithm (8) implemented in ssearch (3) is the 
oldest and slowest but the most rigorous. Modern heuristics 
have provided blast (1) the speed and convenience to make 
it the most popular program. Intermediate between these two 
is fasta (3), which may be run in two modes offering either 
greater speed (ktup = 2) or greater effectiveness (ktup = 1). 
Pearson also considered different parameters for each of these 
programs. 

To test the methods, Pearson selected two representative 
proteins from each of 67 protein superfamilies defined by the 
pir database (9). Each was used as a query to search the 
database, and the matched proteins were marked as being 
homologous or unrelated according to their membership of PIR 
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super families. Pearson found that modern matrices and "In- 
scaling" of raw scores improve results considerably. He also 
reported that the rigorous Smith-Waterman algorithm worked 
slightly better than fasta, which was in turn more effective 
than blast. 

Very large scale analyses of matrices have been performed 
(10), and Henikoff and Henikoff (11) also evaluated the 
effectiveness of blast and fasta. Their test with buvst 
considered the ability to detect homologs above a predeter- 
mined score but had no penalty for methods which also 
reported large numbers of spurious matches. The Henikoffs 
searched the swiss-prot database (12) and used prosite (13) 
to define homologous families. Their results showed that the 
BLOSUM62 matrix (14) performed markedly better than the 
extrapolated PAM-series matrices (15), which previously had 
been popular. 

A crucial aspect of any assessment is the data that are used 
to test the ability of the program to find homologs. But in 
Pearson's and the Henikoffs' evaluations of sequence com- 
parison, the correct results were effectively unknown. This is 
because the superfamilies in pir and prosite are principally 
created by using the same sequence comparison methods 
which are being evaluated. Interdependency of data and 
methods creates a "chicken and egg" problem, and means for 
example, that new methods would be penalized for correctly 
identifying homologs missed by older programs. For instance, 
immunoglobulin variable and constant domains are clearly 
homologous, but pir places them in different superfamilies. 
The problem is widespread: each superfamily in pir 48.00 with 
a structural homolog is itself homologous to an average of 1.6 
other pir superfamilies (16). 

To surmount these sorts of difficulties, Sander and Schnei- 
der (17) used protein structures to evaluate sequence com- 
parison. Rather than comparing different sequence compari- 
son algorithms, their work focused on determining a length- 
dependent threshold of percentage identity, above which all 
proteins would be of similar structure. A result of this analysis 
was the hssp equation; it states that proteins with 25% identity 
oyer 80 residues will have similar structures, whereas shorter 
alignments require higher identity. (Other studies also have 
used structures (18-20), but these focused on a small number 
of model proteins and were principally oriented toward eval- 
uating alignment accuracy rather than homology detection.) 

A general solution to the problem of scoring comes from 
statistical measures (i.e., E-values and P-values) based on the 
extreme value distribution (21). Extreme value scoring was 
implemented analytically in the blast program using the 
Karlin and Altschul statistics (22, 23) and empirical ap- 
proaches have been recently added to fasta and ssearch. In 
addition to being heralded as a reliable means of recognizing 
significantly similar proteins (24, 25), the mathematical trac- 
tability of statistical scores "is a crucial feature of the blast 
algorithm" (1). The validity of this scoring procedure has been 
tested analytically and empirically (see ref. 2 and references in 
ref. 24). However, all large empirical tests used random 
sequences that may lack the subtle structure found within 
biological sequences (26, 27) and obviously do not contain any 
real homologs. Thus, although many researchers have sug- 
gested that statistical scores be used to rank matches (24, 25, 
28), there have been no large rigorous experiments on biolog- 
ical data to determine the degree to which such rankings are 
superior. 

A Database for Testing Homology Detection. Since the 
discovery that the structures of hemoglobin and myoglobin are 
very similar though their sequences are not (29), it has been 
apparent that comparing structures is a more powerful (if less 
convenient) way to recognize distant evolutionary relation- 
ships than comparing sequences. If two proteins show a high 
degree of similarity in their structural details and function, it 
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is very probable that they have an evolutionary relationship 
though their sequence similarity may be low. 

The recent growth of protein structure information com- 
bined with the comprehensive evolutionary classification in 
the scop database (4, 5) have allowed us to overcome previous 
limitations. With these data, we can evaluate the performance 
of sequence comparison methods on real protein sequences 
whose relationships are known confidently. The scop database 
uses structural information to recognize distant homologs, the 
large majority of which can be determined unambiguously. 
These superfamilies, such as the globins or the immunoglobu- 
lins, would be recognized as related by the vast majority of the 
biological community despite the lack of high sequence sim- 
ilarity. 

From scop, we extracted the sequences of domains of 
proteins in the Protein Data Bank (pdb) (30) and created two 
databases. One (PDB90D-B) has domains, which were all <90% 
identical to any other, whereas (PDB40D-B) had those <40% 
identical. The databases were created by first sorting all 
protein domains in SCOP by their quality and making a list. The 
highest quality domain was selected for inclusion in the 
database and removed from the list. Also removed from the list 
(and discarded) were all other domains above the threshold 
level of identity to the selected domain. This process was 
repeated until the list was empty. The PDB40D-B database 
contains 1,323 domains, which have 9,044 ordered pairs of 
distant relationships, or *»0.5% of the total 1,749,006 ordered 
pairs. In PDB90D-B, the 2,079 domains have 53,988 relation- 
ships, representing 1.2% of all pairs. Low complexity regions 
of sequence can achieve spurious high scores, so these were 
masked in both databases by processing with the seg program 
(27) using recommended parameters: 12 1.8 2.0. The databases 
used in this paper are available from http://sss.stanford.edu/ 
sss/, and databases derived from the current version of scop 
may be found at http://scop.mrc-lmb.cam.ac.uk/scop/. 

Analyses from both databases were generally consistent, but 
PDB40D-B focuses on distantly related proteins and reduces the 
heavy overrepresentation in the PDB of a small number of 
families (31, 32), whereas PDB90D-B (with more sequences) 
improves evaluations of statistics. Except where noted other- 
wise, the distant homolog results here are from PDB40D-B. 
Although the precise numbers reported here are specific to the 
structural domain databases used, we expect the trends to be 
general. 

Assessment Data and Procedure. Our assessment of se- 
quence comparison may be divided into four different major 
categories of tests. First, using just a single sequence compar- 
ison algorithm at a time, we evaluated the effectiveness of 
different scoring schemes. Second, we assessed the reliability 
of scoring procedures, including an evaluation of the validity 
of statistical scoring. Third, we compared sequence compari- 
son algorithms (using the optimal scoring scheme) to deter- 
mine their relative performance. Fourth, we examined the 
distribution of homologs and considered the power of pairwise 
sequence comparison to recognize them. All of the analyses 
used the databases of structurally identified homologs and a 
new assessment criterion. 

The analyses tested BLAST (1), version 1.4.9MP, and wu- 
BLAST2 (2), version 2.0al3MP. Also assessed was the FASTA 
package, version 3.0t76 (3), which provided fasta and the 
ssearch implementation of Smith-Waterman (8). For 
ssearch and fasta, we used BLOSUM45 with gap penalties 
— 12/— 1 (7, 16). The default parameters and matrix (BLO- 
SUM62) were used for blast and wu-blast2. 

The "Coverage Vs. Error" Plot To test a particular protocol 
(comprising a program and scoring scheme), each sequence 
from the database was used as a query to search the database. 
This yielded ordered pairs of query and target sequences with 
associated scores, which were sorted, on the basis of their 
scores, from best to worst. The ideal method would have 



Biochemistry: Brenner et al 

Smith-Waterman Scoring Schemes (PDB40D-B) 



Proc. Natl. Acad. ScL USA 95 (1998) 6075 
Smlth-Waterman Scoring Schemes (POB90D-B) 




0.1 0.15 
Coverage 



0.25 




0.2 0.3 
Coverage 



Fig. 1. Coverage vs. error plots of different scoring schemes for ssearch Smith-Waterman. (A) Analysis of PDB40D-B database. (B) Analysis 
of pdbwd-b database. AH of the proteins in the database were compared with each other using the ssearch program. The results of this single 
set of comparisons were considered using five different scoring schemes and assessed. The graphs show the coverage and errors per query (EPQ) 
for statistical scores, raw scores, and three measures using percentage identity. In the coverage vs. error plot, the x axis indicates the fraction of 
all homologs in the database (known from structure) which have been detected. Precisely, it is the number of detected pairs of proteins with the 
same fold divided by the total number of pairs from a common superfamily. pdbwd-b contains a total of 9,044 homologs, so a score of 10% indicates 
identification of 904 relationships. The y axis reports the number of EPQ. Because there are 1,323 queries made in the PDB40D-B all-vs.-all 
comparison, 13 errors corresponds to 0.01, or 1% EPQ. They axis is presented on a log scale to show results over the widely varying degrees of 
accuracy which may be desired. The scores that correspond to the levels of EPQ and coverage are shown in Fig. 4 and Table 1. The graph 
demonstrates the trade-off between sensitivity and selectivity. As more homologs are found (moving to the right), more errors are made (moving 
up). The ideal method would be in the lower right corner of the graph, which corresponds to identifying many evolutionary relationships without 
selecting unrelated proteins. Three measures of percentage identity are plotted. Percentage identity within alignment is the degree of identity within 
the aligned region of the proteins, without consideration of the alignment length. Percentage identity within both is the number of identical residues 
in the aligned region as a percentage of the average length of the query and target proteins. The hssp equation (17) is H - 290.15/- 0 - 562 where 
/ is length for 10 < / < 80; H > 100 for / < 10; H = 24.7 for / > 80. The percentage identity HSSP-adjusted score is the percent identity within 
the alignment minus H. Smith- Water man raw scores and E-values were taken directly from the sequence comparison program. 



perfect separation, with all of the homologs at the top of the 
list and unrelated proteins below. In practice, perfect separa- 
tion is impossible to achieve so instead one is interested in 
drawing a threshold above which there are the largest number 
of related pairs of sequences consistent with an acceptable 
error rate. 

Our procedure involved measuring the coverage and error 
for every threshold. Coverage was defined as the fraction of 
structurally determined homologs that have scores above the 
selected threshold; this reflects the sensitivity of a method. 
Errors per query (EPQ), an indicator of selectivity, is the 
number of nonhomologous pairs above the threshold divided 
by the number of queries. Graphs of these data, called 
coverage vs. error plots, were devised to understand how 



protocols compare at different levels of accuracy. These 
graphs share effectively all of the beneficial features of Re- 
ciever Operating Characteristic (ROC) plots (33, 34) but 
better represent the high degrees of accuracy required in 
sequence comparison and the huge background of nonho- 
mologs. 

This assessment procedure is directly relevant to practical 
sequence database searching, for it provides precisely the 
information necessary to perform a reliable sequence database 
search. The EPQ measure places a premium on score consis- 
tency; that is, it requires scores to be comparable for different 
queries. Consistency is an aspect which has been largely 
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Fig. 2. Unrelated proteins with high percentage identity. Hemo- 
globin ^chain (pdb code Ihds chain b, ref. 38, Left) and cellulase E2 
(PDB code ltml, ref. 39, Right) have 39% identity over 64 residues, a 
level which is often believed to be indicative of homology. Despite this 
high degree of identity, their structures strongly suggest that these 
proteins are not related. Appropriately, neither the raw alignment 
score of 85 nor the E-value of 1 .3 is significant. Proteins rendered by 
RASMOL (40). 



Each point plots the length and. 
percent identity of an alignment 
between. two unrelated. proteins . 
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Fig. 3. Length and percentage identity of alignments of unrelated 
proteins in PDB90D-B: Each pair of nonhomologous proteins found with 
ssearch is plotted as. a point whose position indicates the length and 
the percentage identity within the alignment. Because alignment 
length and percentage identity are quantized, many pairs of proteins 
may have exactly the same alignment length and percentage identity. 
The line shows the hssp threshold (though it is intended to be applied 
with a different matrix and parameters). 
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Fig. 4. Reliability of statistical scores in pdbwd-b: Each line shows 
the relationship between reported statistical score and actual error 
rate for a different program. E-values are reported for ssearch and 
fast a, whereas P-values are shown for blast and wu-BLA5T2. If the 
scoring were perfect, then the number of errors per query and the 
E-values would be the same, as indicated by the upper bold line. 
(P-values should be the same as EPQ for small numbers, and diverges 
at higher values, as indicated by the lower bold line.) E-values from 
ssearch and FASTA are shown to have good agreement with EPQ but 
underestimate the significance slightly, blast and wu-blasT2 are 
overconfident, with the degree of exaggeration dependent upon the 
score. The results for PDB40D-B were similar to those for PDB90D-B 
despite the difference in number of homologs detected. This graph 
could be used to roughly calibrate the reliability of a given statistical 
score. 

ignored in previous tests but is essential for the straightforward 
or automatic interpretation of sequence comparison results. 
Further, it provides a clear indication of the confidence that 
should be ascribed to each match. Indeed, the EPQ measure 
should approximate the expectation value reported by data- 
base searching programs, if the programs* estimates are accu- 
rate. 

The Performance of Scoring Schemes. All of the programs 
tested could provide three fundamental types of scores. The 
first score is the percentage identity, which may be computed 
in several ways based on either the length of the alignment or 
the lengths of the sequences. The second is a "raw" or 
"Smith-Waterman" score, which is the measure optimized by 
the Smith-Waterman algorithm and is computed by summing 
the substitution matrix scores for each position in the align- 
ment and subtracting gap penalties. In blast, a measure 
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related to this score is scaled into bits. Third is a statistical 
score based on the extreme value distribution. These results 
are summarized in Fig. 1. 

Sequence Identity. Though it has been long established that 
percentage identity is a poor measure (35), there is a common 
rule-of-thumb stating that 30% identity signifies homology. 
Moreover, publications have indicated that 25% identity can 
be used as a threshold (17, 36). We find that these thresholds, 
originally derived years ago, are not supported by present 
results. As databases have grown, so have the possibilities for 
chance alignments with high identity: thus, the reported cutoffs 
lead to frequent errors. Fig. 2 shows one of the many pairs of 
proteins with very different structures that nonetheless have 
high levels of identity over considerable aligned regions. 
Despite the high identity, the raw and the statistical scores for 
such incorrect matches are typically not significant. The prin- 
cipal reasons percentage identity does so poorly seem to be 
that it ignores information about gaps and about the conser- 
vative or radical nature of residue substitutions. 

From the pdbskjd-B analysis in Fig. 3, we learn that 30% 
identity is a reliable threshold for this database only for 
sequence alignments of at least 150 residues. Because one 
unrelated pair of proteins has 43,5% identity over 62 residues, 
it is probably necessary for alignments to be at least 70 residues 
in length before 40% is a reasonable threshold, for a database 
of this particular size and composition. 

At a given reliability, scores based on percentage identity 
detect just a fraction of the distant homologs found by 
statistical scoring. If one measures the percentage identity in 
the aligned regions without consideration of alignment length, 
then a negligible number of distant homologs are detected. 
Use of the HSSP equation improves the value of percentage 
identity, but even this measure can find only 4% of all known 
homologs at 1% EPQ. In short, percentage identity discards 
most of the information measured in a sequence comparison. 

Raw Scores. Smith-Waterman raw scores perform better 
than percentage identity (Fig. 1), but ln-scaling (7) provided no 
notable benefit in our analysis. It is necessary to be very precise 
when using either raw or bit scores because a 20% change in 
cutoff score could yield a tenfold difference in EPQ. However, 
it is difficult to choose appropriate thresholds because the 
reliability of a bit score depends on the lengths of the proteins 
matched and the size of the database. Raw score thresholds 
also are affected by matrix and gap parameters. 

Statistical Scores. Statistical scores were introduced partly 
to overcome the problems that arise from raw scores. This 
scoring scheme provides the best discrimination between 
homologous proteins and those which are unrelated. Most 
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Fig. 5. Coverage vs. error plots of different sequence comparison methods: Five different sequence comparison methods are evaluated, each 
using statistical scores (E- or P-values). (A ) PDB40D-B database. In this analysis, the best method is the slow ssearch, which finds 1 8% of relationships 
at 1% EPQ. FASTA ktup - 1 and wu-blasT2 are almost as good. (B) PDB90D-B database. The quick wu-blast2 program provides the best coverage 
at 1% EPQ on this database^ although at higher levels of error it becomes slightly worse than fasta ktup = 1 and ssearch. 
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likely, its power can be attributed to its incorporation of more 
information than any other measure; it takes account of the 
full substitution and gap data (like raw scores) but also has 
details about the sequence lengths and composition and is 
scaled appropriately. 

We find that statistical scores are not only powerful, but also 
easy to interpret, ssearch and fasta show close agreement 
between statistical scores and actual number of errors per 
query (Fig. 4). The expectation value score gives a good, 
slightly conservative estimate of the chances of the two se- 
quences being found at random in a given query. Thus, an 
E-value of 0.01 indicates that roughly one pair of non homologs 
of this similarity should be found in every 100 different queries. 
Neither raw scores nor percentage identity can be interpreted 
in this way, and these results validate the suitability of the 
extreme value distribution for describing the scores from a 
database search. 

The P-values from blast also should be directly interpret- 
abie but were found to overstate significance by more than two 
orders of magnitude for 1% EPQ for this database. Nonethe- 
less, these results strongly suggest that the analytic theory is 
fundamentally appropriate. WU-BLAST2 scores were more re- 
liable than those from blast, but also exaggerate expected 
confidence by more than an order of magnitude at 1% EPQ. 

Overall Detection of Homologs and Comparison of Algo- 
rithms. The results in Fig. 5A and Table 1 show that pairwise 
sequence comparison is capable of identifying only a small 
fraction of the homologous pairs of sequences in PDB40D-B. 
Even ssearch with E-values, the best protocol tested, could 
find only 18% of all relationships at a 1% EPQ. BLAST, which 
identifies 15%, was the worst performer, whereas fasta 
ktup = 1 is nearly as effective as ssearch. fasta ktup = 2 and 
WU-BLAST2 are intermediate in their ability to detect ho- 
mologs. Comparison of different algorithms indicates that 
those capable of identifying more homologs are generally 
slower, ssearch is 25 times slower than BLAST and 6.5 times 
slower than fasta ktup = 1. WU-BLAST2 is slightly faster than 
fasta ktup = 2, but the latter has more interpretable scores. 

In PDB90D-B, where there are many close relationships, the 
best method can identify only 38% of structurally known 
homologs (Fig. 55). The method which finds that many 
relationships is wu-BLAST2. Consequently, we infer that the 
differences between fasta kup = 1, ssearch, and wu-blast2 
programs are unlikely to be significant when compared with 
variation in database composition and scoring reliability. 

Fig. 6 helps to explain why most distant homologs cannot be 
found by sequence comparison: a great many such relation- 
ships have no more sequence identity than would be expected 
by chance, ssearch with E-values can recognize >90% of the 
homologous pairs with 30-40% identity. In this region, there 
are 30 pairs of homologous proteins that do not have signif- 
icant E-values, but 26 of these involve sequences with <50 
residues. Of sequences having 25-30% identity, 75% are 
identified by ssearch E-values. However, although the num- 
ber of homologs grows at lower levels of identity, the detection 
falls off sharply: only 40% of homologs with 20-25% identity 
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Fig. 6. Distribution and detection of homologs in PDB40D-B. Bars 
show the distribution of homologous pairs PDB40D-B according to their 
identity (using the measure of identity in both). Filled regions indicate 
the number of these pairs found by the best database searching method 
(ssearch with E-values) at 1 % EPQ. The PDB40D-B database contains 
proteins with <40% identity, and as shown on this graph, most 
structurally identified homologs in the database have diverged ex- 
tremely far in sequence and have <20% identity. Note that the 
alignments may be inaccurate, especially at low levels of identity. Filled 
regions show that ssearch can identify most relationships that have 
25% or more identity, but its detection wanes sharply below 25%. 
Consequently, the great sequence divergence of most structurally 
identified evolutionary relationships effectively defeats the ability of 
pari wise sequence comparison to detect them. 

are detected and only 10% of those with 15-20% can be found. 
These results show that statistical scores can find related 
proteins whose identity is remarkably low; however, the power 
of the method is restricted by the great divergence of many 
protein sequences. 

After completion of this work, a new version of pairwise 
blast was released: blastgp (37). It supports gapped align- 
ments, like wu-blasT2, and dispenses with sum statistics. Our 
initial tests on blastgp using default parameters show that its 
E-values are reliable and that its overall detection of homologs 
was substantially better than that of ungapped blast, but not 
quite equal to that of WU-BLAST2. 

CONCLUSION 

The general consensus amongst experts (see refs. 7, 24, 25, 27 
and references therein) suggests that the most effective se- 
quence searches are made by (/) using a large current database 
in which the protein sequences have been complexity masked 
and (//) using statistical scores to interpret the results. Our 
experiments fully support this view. 

Our results also suggest two further points. First, the E-val- 
ues reported by fasta and ssearch give fairly accurate 
estimates of the significance of each match, but the P-values 
provided by blast and WU-BLAST2 underestimate the true 



Table 1 . Summary of sequence comparison methods with PDB40D-B 


Method 


Relative Time* 


1% EPQ Cutoff 


Coverage at 1% EPQ 


SSEARCH % identity: within alignment 


25.5 


>70% 


<0.1 


ssearch % identity: within both 


25.5 


34% 


3.0 


SSEARCH % identity: Hssp-scaled 


25.5 


35% (hssp + 9.8) 


4.0 


SSEARCH Smith- Waterman raw scores 


25.5 


142 


10.5 


ssearch E-values 


25.5 


0.03 


18.4 


fasta ktup = 1 E-values - 


3.9 


0.03 


17.9 


Fasta ktup = 2 E-values 


1.4 


0.03 


16.7 


wu-BLAST2 P-values 


1.1 


0.003 


17.5 


blast P-values 


1.0 


0.00016 


14.8 



*Times are from large database searches with genome proteins. 
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extent of errors. Second, ssearch, wu-blasT2, and fasta 
ktup = 1 perform best, though blast and fasta ktup = 2 
detect most of the relationships found by the best procedures 
and are appropriate for rapid initial searches. 

The homologous proteins that are found by sequence com- 
parison can be distinguished with high reliability from the huge 
number of unrelated pairs. However, even the best database 
searching procedures tested fail to find the large majority of 
distant evolutionary relationships at an acceptable error rate. 
Thus, if the procedures assessed here fail to find a reliable 
match, it does not imply that the sequence is unique; rather, it 
indicates that any relatives it might have are distant ones.** 



** Additional and updated information about this work, including 
supplementary figures, may be found at http://sss.stanford.edu/sss/. 
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